Build a voice agent
📖 6 min readUpdated 2026-04-18
A voice agent is an agent that talks. Phone calls, voice chat, voice notes. The architecture is standard; the interesting parts are latency and barge-in.
The pipeline
Caller audio → STT → text → LLM → text → TTS → audio back to caller
Each hop adds latency. Under ~1.5 seconds end-to-end feels natural. Over 3 seconds and the caller hangs up.
Architecture at a glance
- Telephony layer. Accepts inbound calls, streams audio. Twilio Voice, Vonage, LiveKit.
- STT. Streaming speech-to-text. Deepgram, AssemblyAI, Whisper (for batch).
- LLM. Claude handles the conversation. Extended thinking OFF (too slow).
- TTS. ElevenLabs, Cartesia, OpenAI TTS. Streaming synthesis is essential.
- Orchestrator. Glues it all together. Handles barge-in, tool calls, state.
Latency budget
Target: <1.5s from end-of-caller-speech to start-of-agent-reply.
- STT finalize: 200ms (streaming)
- LLM first-token: 400–700ms (prompt-cached, Haiku for speed)
- TTS first-audio: 200–400ms
- Network + buffers: 100–200ms
Prompt caching is not optional here, it can cut LLM first-token by 40%.
State machine
Voice agents are state machines. States include:
- Idle, greeting, waiting for caller
- Listening, caller speaking, STT running
- Thinking. LLM generating response, tools may be called
- Speaking. TTS streaming audio back
- Transferring, handing off to a human
Transitions are driven by: silence detection, LLM output, tool results, caller hang-up.
Barge-in
If the caller interrupts while the agent is speaking, the agent must stop immediately. Implementation:
- Monitor caller audio level during Speaking state
- If caller speaks for >300ms, abort the TTS stream
- Transition back to Listening; preserve partial agent utterance in the context
Tool use in voice agents
Slow tools ruin voice. Two tactics:
- Pre-fetch data. When the caller first connects, fetch likely-needed data (caller profile, recent interactions) in parallel with the greeting.
- Filler speech during long tool calls. "Let me look that up for you...", buys 2–3 seconds while the tool runs.
Safety
Voice agents sound authoritative. That's dangerous, callers trust what they hear. Must-haves:
- Disclose it's an AI in the opening greeting
- Never quote prices, promise commitments, or diagnose without handoff
- Have a clear "transfer to human" path on any ambiguity
- Recording + transcription for every call for later review
Evaluating a voice agent
- Task success rate (did the caller accomplish their goal?)
- Avg call duration (too long = painful)
- Transfer rate (too high = agent not useful; too low = maybe missing real handoffs)
- Caller satisfaction (post-call survey or sentiment in transcript)
- Tool call latency distribution
Common pitfalls
- Too robotic. Use natural TTS voices. Vary cadence. Add disfluencies ("Hmm...") sparingly.
- Won't shut up. Always implement barge-in. Always.
- Loses context. Keep conversation history in memory; fetch caller history on connect.
- Over-promises. "I'll get that done for you" when the agent can't actually do it. Constrain the system prompt.