How to Set Up an AI Voice Agent From Scratch
A practical guide to building an AI voice agent using platforms like Vapi, Retell, and LiveKit - covering architecture, setup steps, and cost estimates.

AI voice agents are no longer a concept demo at a tech conference. They're answering real phone calls right now - booking dental appointments, handling insurance claims, qualifying sales leads, and running entire customer support lines. If you've called a business recently and had a surprisingly natural conversation with what turned out to be AI, you've already met one.
This guide walks you through how voice agents actually work, which platforms you can use to build one, and how to go from zero to a working voice agent you can test with a real phone call. No machine learning PhD required.
TL;DR
- Voice agents use a three-step pipeline: Speech-to-Text (STT) converts your words to text, a Large Language Model (LLM) generates a response, and Text-to-Speech (TTS) reads it back.
- Managed platforms like Vapi, Retell AI, and Bland.ai let you build a working agent in under an hour. Expect to pay $0.07 - $0.25 per minute depending on your setup.
- Open-source options like LiveKit Agents and Vocode give you full control but require more technical effort.
- The biggest challenge is latency. Users notice awkward pauses beyond one second, so every millisecond in the pipeline matters.
How Voice Agents Work
At their core, voice agents run on a three-stage pipeline. Understanding this chain helps you make smarter decisions when choosing tools and providers.
Stage 1 - Speech-to-Text (STT). The agent listens to audio input from a phone call or microphone and converts it to text. Popular STT providers include Deepgram, AssemblyAI, and Google Cloud Speech. This step normally takes 100 - 500 milliseconds.
Stage 2 - Large Language Model (LLM). The transcribed text gets fed to an LLM - think GPT-4o, Claude, or Gemini - that understands the request and generates a text response. You control its behavior through a system prompt, just like you would with any chatbot. This step is usually the slowest, ranging from 200ms to 2,000ms depending on the model and response length.
Stage 3 - Text-to-Speech (TTS). The LLM's text response gets converted to natural-sounding speech and played back to the caller. Providers like ElevenLabs, PlayHT, and Cartesia handle this in 200 - 800ms.
Add it all up, and a typical round-trip takes somewhere between 500ms and 3 seconds. The best production systems hit sub-second latency by streaming each stage - the TTS starts speaking before the LLM has finished creating, and the LLM starts processing before the STT has the complete utterance.
There's a newer approach worth mentioning: speech-to-speech models like OpenAI's Realtime API skip the STT and TTS steps completely, processing audio in and audio out. They're faster but less flexible - you can't easily swap components or inspect what's happening between stages. Most production voice agents still run the three-stage pipeline because it's easier to debug, customize, and control costs.
If you're new to AI agents in general, the concept is the same - an AI that can take actions autonomously. Voice agents just happen to do it over a phone call.
Platform Options
You don't have to build the pipeline yourself. Several platforms handle the orchestration, telephony, and infrastructure so you can focus on the conversation design. Here's how the major options compare:
| Platform | Type | Starting Cost | Key Strength | Best For |
|---|---|---|---|---|
| Vapi | Managed | ~$0.05/min + provider costs | Maximum flexibility, swap any component | Developers who want fine-grained control |
| Retell AI | Managed | ~$0.07/min all-in | Drag-and-drop builder, sub-600ms latency | Teams that want a visual builder |
| Bland.ai | Managed | ~$0.11 - $0.14/min | Simple API, fast to deploy | Outbound sales and cold calling |
| LiveKit Agents | Open-source + cloud | Free self-hosted / cloud pricing varies | Battle-tested (powers ChatGPT Voice) | Developers who want open-source foundations |
| Vocode | Open-source | Free | Highly composable Python library | Developers who want full code control |
| Twilio ConversationRelay | Telephony + AI | Twilio calling rates + LLM costs | Native phone infrastructure | Businesses already on Twilio |
A note on pricing: the per-minute costs above can be misleading. Vapi's $0.05/min is just the platform fee - you still pay separately for STT, LLM, and TTS providers. In practice, a fully assembled Vapi agent typically costs $0.15 - $0.30 per minute. Retell bundles more into their base price. Bland.ai's rates include most components but charge extra for phone numbers ($15/month) and SMS.
Step-by-Step: Build a Voice Agent With Vapi
Vapi is a good starting point because it gives you a clear view of every component in the pipeline. You'll understand exactly what's happening under the hood. This walkthrough gets you to a working test call in about 30 minutes.
1. Create an Account and Add API Keys
Sign up at vapi.ai and grab a free trial (most platforms offer some amount of free minutes to start). You'll need API keys from your chosen providers:
- STT: Create a Deepgram account (they offer a generous free tier)
- LLM: Get an API key from OpenAI, Anthropic, or whichever model provider you prefer
- TTS: Sign up for ElevenLabs or PlayHT for high-quality voices
Add these keys in Vapi's dashboard under the Provider Keys section.
2. Create Your Assistant
In the Vapi dashboard, create a new assistant. This is where you define the agent's personality and behavior. The three main settings to configure:
Voice selection. Pick a TTS provider and voice. ElevenLabs voices sound the most natural but cost more. For testing, the built-in options work fine.
Model selection. Choose your LLM. GPT-4o is a solid default - it's fast and capable. Claude works well too, especially for nuanced conversations. For cost-sensitive use cases, consider GPT-4o-mini or open-source models through providers like Groq.
System prompt. This is where you define who the agent is and how it should behave. A good system prompt for a voice agent is different from a chatbot prompt. Keep it concise and include:
You are a friendly appointment scheduler for Greenfield Dental Clinic.
Rules:
- Keep responses short (1-2 sentences). People don't want to listen to paragraphs.
- Available appointment slots: weekdays 9am-5pm.
- Collect: patient name, preferred date/time, reason for visit.
- If a slot is unavailable, suggest the nearest alternative.
- Confirm all details before ending the call.
Tone: warm, professional, conversational. Use contractions.
Short responses are essential for voice. A response that reads fine on screen feels painfully long when spoken aloud.
3. Configure Telephony
To receive actual phone calls, you'll need a phone number. Vapi can provision one for you directly, or you can connect a Twilio number via SIP (Session Initiation Protocol - the standard for internet phone calls). Twilio numbers cost about $1.15/month for a US number plus per-minute calling rates.
For initial testing, skip the phone number completely and use Vapi's built-in web call feature to test through your browser.
4. Test Your Agent
Run a test call from the dashboard. Talk to your agent like a real caller would. Pay attention to:
- Latency: Does the agent respond fast enough to feel natural? Target under one second.
- Interruption handling: Can you cut in mid-sentence? Good agents handle "barge-in" gracefully.
- Accuracy: Does it collect the right information and follow your system prompt?
- Conversation flow: Does it feel like talking to a helpful person, or a menu system?
Iterate on your system prompt based on these test calls. Most of the tuning work in voice agents is prompt engineering, not code.
5. Deploy
Once you're happy with testing, assign a phone number and you're live. Vapi provides webhooks so you can send collected data (appointments, lead info, support tickets) to your CRM, calendar, or database.
The Open-Source Route
If you'd rather own the full stack - or you need to keep voice data on your own servers for compliance reasons - open-source frameworks are mature enough for production use.
LiveKit Agents
LiveKit is probably the strongest open-source option right now. It's the same infrastructure that powers ChatGPT's Advanced Voice mode, serving millions of conversations daily. The Agents framework provides Python and Node.js SDKs with built-in plugins for most major STT, LLM, and TTS providers.
A minimal LiveKit voice agent in Python looks roughly like this:
from livekit.agents import AutoSubscribe, JobContext, WorkerOptions, cli
from livekit.agents.voice import VoicePipelineAgent
from livekit.plugins import deepgram, openai, silero
async def entrypoint(ctx: JobContext):
await ctx.connect(auto_subscribe=AutoSubscribe.AUDIO_ONLY)
agent = VoicePipelineAgent(
vad=silero.VAD.load(),
stt=deepgram.STT(),
llm=openai.LLM(),
tts=openai.TTS(),
)
agent.start(ctx.room)
if __name__ == "__main__":
cli.run_app(WorkerOptions(entrypoint_fnc=entrypoint))
You still need API keys for the individual providers (Deepgram, OpenAI, etc.), but LiveKit handles the orchestration, turn detection, and interruption handling - the hard parts of real-time voice. You can self-host the LiveKit server or use their managed cloud.
Vocode
Vocode takes a more modular approach. It's a Python library backed by Y Combinator that handles the real-time pipeline plumbing - capturing audio, routing it through STT, managing the LLM conversation, and playing back TTS. It supports Twilio and other telephony providers out of the box, so you can connect to real phone numbers.
Vocode shines when you need deep customization of the conversation flow, like building agents that can transfer calls, handle multiple intents in a single conversation, or integrate with custom business logic.
For a comparison of the broader agent framework options, we have a separate guide.
Cost Breakdown
Voice agent costs are driven by three components, all billed by the minute (or by tokens, in the LLM's case). Here's what to budget:
| Component | Typical Cost | Example Provider |
|---|---|---|
| STT | $0.01 - $0.04/min | Deepgram, AssemblyAI |
| LLM | $0.02 - $0.10/min | GPT-4o, Claude, GPT-4o-mini |
| TTS | $0.02 - $0.08/min | ElevenLabs, PlayHT, Cartesia |
| Platform/orchestration | $0.00 - $0.05/min | Vapi, Retell (or $0 if self-hosted) |
| Telephony | $0.01 - $0.03/min | Twilio, Telnyx |
| Total | $0.06 - $0.30/min |
For a small business handling 500 minutes of calls per month, that works out to roughly $30 - $150/month. A contact center handling 50,000 minutes might spend $3,000 - $15,000/month - still potentially much cheaper than human agents at $15 - $25/hour.
The biggest cost lever is your LLM choice. GPT-4o-mini or Claude Haiku can cut your LLM cost by 80 - 90% compared to full-size models, and for many scripted use cases (appointment booking, FAQ handling), they perform just as well.
Want to hear how different AI voice generators compare on quality? Check our roundup for audio samples and rankings.
Common Challenges
Building the voice agent is the easy part. Making it work well in production is where the real effort goes.
Latency is the enemy. Anything above one second of silence feels unnatural. Users start saying "hello?" or hang up. To keep latency low: pick fast models (GPT-4o-mini, Groq-hosted open models), use streaming everywhere, choose STT and TTS providers with servers close to your users, and keep your system prompts short. Every extra token the LLM has to process adds delay.
Interruptions are tricky. Real conversations have interruptions constantly. Someone says "actually, wait" mid-sentence. Good voice agent platforms include Voice Activity Detection (VAD) to handle barge-in - detecting when the user starts speaking and gracefully stopping the agent's output. If your platform doesn't handle this well, conversations feel robotic. Test this early.
Hallucination hits different in voice. When a chatbot hallucinates, users can re-read the text and catch the error. When a voice agent confidently states wrong information at speaking pace, the caller takes it as fact. Keep your agent's scope narrow, ground it with function calls to real data sources, and be explicit in your system prompt about saying "I don't know" rather than guessing.
Accents and background noise. STT accuracy drops in noisy environments and with strong accents. Deepgram and AssemblyAI have improved a lot here, but test with your actual user base. If you're serving a multilingual audience, verify STT and TTS support for each language - not all providers cover the same languages. For the latest on how speech models compare, see our voice and speech leaderboard.
Phone audio quality. Phone calls compress audio heavily compared to a studio microphone. Test your agent over actual phone lines, not just browser audio. The difference in STT accuracy can be significant.
Where to Go From Here
Start with a managed platform like Vapi or Retell to get something working fast. You can always migrate to self-hosted LiveKit later if you need more control or lower costs at scale. The important thing is to ship a working prototype and test it with real callers - you'll learn more from 10 real calls than from weeks of configuration tweaking.
Voice agents have gotten surprisingly good in 2025 and 2026, and the costs keep dropping. If you have a use case that involves repetitive phone conversations - scheduling, FAQs, lead qualification, order status checks - there's very little reason not to test one.
Sources:
Last updated
✓ Last verified March 9, 2026
