Build a voice agent

A voice agent is an agent that talks. Phone calls, voice chat, voice notes. The architecture is standard; the interesting parts are latency and barge-in.

The pipeline

Caller audio → STT → text → LLM → text → TTS → audio back to caller

Each hop adds latency. Under ~1.5 seconds end-to-end feels natural. Over 3 seconds and the caller hangs up.

Architecture at a glance

  1. Telephony layer. Accepts inbound calls, streams audio. Twilio Voice, Vonage, LiveKit.
  2. STT. Streaming speech-to-text. Deepgram, AssemblyAI, Whisper (for batch).
  3. LLM. Claude handles the conversation. Extended thinking OFF (too slow).
  4. TTS. ElevenLabs, Cartesia, OpenAI TTS. Streaming synthesis is essential.
  5. Orchestrator. Glues it all together. Handles barge-in, tool calls, state.

Latency budget

Target: <1.5s from end-of-caller-speech to start-of-agent-reply.

Prompt caching is not optional here, it can cut LLM first-token by 40%.

State machine

Voice agents are state machines. States include:

Transitions are driven by: silence detection, LLM output, tool results, caller hang-up.

Barge-in

If the caller interrupts while the agent is speaking, the agent must stop immediately. Implementation:

Tool use in voice agents

Slow tools ruin voice. Two tactics:

  1. Pre-fetch data. When the caller first connects, fetch likely-needed data (caller profile, recent interactions) in parallel with the greeting.
  2. Filler speech during long tool calls. "Let me look that up for you...", buys 2–3 seconds while the tool runs.

Safety

Voice agents sound authoritative. That's dangerous, callers trust what they hear. Must-haves:

Evaluating a voice agent

Common pitfalls