Home›Framework›Build›Build a voice agent

Build a voice agent

📖 6 min readUpdated 2026-04-18

A voice agent is an AI that talks with someone on the phone (or in any voice channel). The architecture is surprisingly standard; every voice agent looks like the same pipeline. The interesting parts - what makes one feel natural and another feel robotic - are latency, barge-in, and how you handle the gaps. This guide walks through the full stack.

The pipeline.

~ voice agent pipeline ~

Each hop adds latency. The big rule: under ~1.5 seconds end-to-end feels natural. Over 3 seconds, the caller thinks the line dropped and hangs up. Everything in this guide is about keeping that budget.

Five pieces you'll glue together.

Telephony layer. Accepts inbound calls, streams audio bidirectionally. Twilio Voice, Vonage, or LiveKit are the usual choices.
Speech-to-text (STT). Streaming - must emit text as the caller is still speaking. Deepgram, AssemblyAI, or streaming Whisper.
LLM. Claude handles the conversation. Haiku or Sonnet here - Opus is too slow for real-time. Extended thinking OFF.
Text-to-speech (TTS). Streaming is essential. ElevenLabs, Cartesia, or OpenAI TTS.
Orchestrator. The glue layer you write. Handles barge-in, tool calls, state transitions.

The latency budget, broken down.

Target: under 1.5 seconds from when the caller stops speaking to when the agent starts replying. Here's how that budget gets spent:

~ 1,500ms budget, where it goes ~

That's the budget tight against reality. Prompt caching isn't optional - it cuts LLM first-token by about 40% on long system prompts. Without caching you're 400ms over budget before you start.

The agent is a state machine.

Voice agents live in discrete states. Most of the hard bugs come from sloppy state transitions.

~ five states, driven by events ~

Barge-in. Why you must always do it.

A voice agent that won't shut up when the caller tries to interrupt is the single most aggravating UX in the world. Every voice agent must implement barge-in.

Mechanics: while in the Speaking state, keep monitoring the caller's audio level. If the caller speaks for more than about 300ms, abort the TTS stream immediately and transition back to Listening. Preserve what the agent had said so far in the conversation context - the model can reference it when it speaks again.

If you skip barge-in, your agent is the phone-tree person who won't stop talking. Callers hate it. Hang-up rate goes up. Don't skip barge-in.

Tool calls without killing the latency budget.

Slow tools ruin voice. A 3-second database lookup is a hang-up. Two tactics:

Pre-fetch on connect. When the caller first connects, fetch the data you'll probably need (caller profile, recent support tickets, account status) in parallel with the greeting. By the time they ask, you have it.
Filler speech during long tool calls. "One moment while I look that up for you..." buys you 2-3 seconds while the tool runs. Pre-record these fillers so they don't themselves incur TTS latency.

Safety. Voice agents sound authoritative. That's dangerous.

Callers believe what they hear. A voice agent that confidently quotes a wrong price is much worse than a chat agent saying the same thing, because callers don't fact-check voice in real time. Minimum safety must-haves:

Disclose it's an AI in the opening greeting. "Hi, I'm the AI assistant for..." is enough. Don't pretend to be human.
Never quote prices, make commitments, or give medical/legal/financial advice. Transfer to a human for any of those.
Always have a 'transfer to human' path. And advertise it. "If you'd like to speak to a person, just say 'representative.'"
Record and transcribe every call for later review. Log calls anywhere someone might want to audit behavior.

How to evaluate a voice agent.

Task success rate. Did the caller actually accomplish what they called for?
Average call duration. Too long = painful. Compare to human baseline.
Transfer rate. Too high = the agent isn't useful. Too low = maybe it's failing to transfer when it should.
Caller satisfaction. A quick post-call survey ("press 1 if this helped") or sentiment analysis of the transcript.
Tool-call latency distribution. P50 and p95. Long-tail latency kills user experience.

Common pitfalls.

Too robotic. Use natural TTS voices. Vary cadence. Sparing disfluencies ("um", "let me see") make it feel real.
Won't shut up. Always implement barge-in. Mentioned three times for a reason.
Loses context. Keep conversation history in memory across the call; fetch caller history on connect.
Over-promises. "I'll take care of that for you" when the agent actually can't. Tight system prompt about what it can and can't commit to.

A good voice agent feels like talking to a competent human. A bad one feels like talking to a menu tree that learned to talk. The difference isn't the model; it's the latency budget, the state machine, and the discipline about what it'll commit to.