xavier collantes

Belva AI: Building Voice Calling AI Agent

By Xavier Collantes

8/20/2025


Page
Years ago, Google demoed Duplex, a voice AI agent that could make phone calls and book appointments. When I first started working at Belva AI, I was tasked with working on the engineering team that would build Belva's competitor to Duplex.
This was one of the first of many products I helped develop at Belva, built entirely from scratch using Python services, websockets, and series of AI models for speech recognition, natural language understanding, and voice synthesis.
Site at the time: belva.ai

Since this project, Belva has pivoted in their products but the experience of building a voice AI agent was a great learning experience.

Requirements

Creating a voice AI agent that could:
  1. Handle natural conversations over phone calls
  2. Make reservations and schedule appointments
  3. Process real-time audio streams
  4. Maintain conversation context throughout calls
  5. Log calls for quality assurance with result and transcript summary

Challenges

Logo

2-Way Real-Time Communication

'Full-Duplex' is a 2-way real-time communication system where both parties can talk at the same time.

To meet the requirements, we needed a Full-Duplex communication system. The reason is this method would best mimic a human-to-human conversation where two parties may interject at the same time.
txt
1AI: What time are you available?
2Human: Uh, I'm not sure.
3AI: That's--
4Human: **interrupting** actually 6pm would be good
5AI: 6pm, I'll make a note
6
snippet hosted withby Xavier
On the backend, the Python code had to have multiple threads to handle the communication between the AI and the human. The challenge was managing the threads and ensuring the communication was seamless within acceptable latency.
We ran one thread to receive and one for sending. The receive thread would listen for the human's response and send it to the LLM.

Human-like

Voice calling has unique challenges because of the nature of human-to-human speech. People speak differently than how they type. A person may say Um, uh, I think, yeah that'd be good. The hardest part was accounting for the pauses because the response from the AI would start after stopping bytes from being received. We solved this by testing timing.

Context Retention

At the time of this project, LLMs had a limited context window. This meant that the agent would forget what was said earlier in the conversation.
We solved this by using a parallel LLM to summarize the conversation and keep the context; a kind of compressor which would reduce the text size but retain main ideas.

Key Learnings

Python Logo ChatGPT Logo
This experience sharpened my skills in:
  • Python
  • WebSockets
  • LLMs (ChatGPT, Llama, Gemini, Sonnet)
  • Speech Recognition (Deepgram)
  • Voice Synthesis (11Labs)
  • Infrastructure (Docker, Kubernetes, Redis, MongoDB, AWS)
  • API Design (FastAPI, Networking)

The Future

At the time, we were building the agent from scratch. Today there are many out of the box options such as Vapi and Cartesia. But the opportunity to build a complex system from scratch was a great learning experience because it challenged our team's creativity and problem-solving skills.

Related Articles

Related by topics:

thingsIBuilt
llm
python
api
Belva AI: Building LLM Microservices Infrastructure

Developed API for LLM microservices using Python, FastAPI, and AWS.

By Xavier Collantes3/20/2025
thingsIBuilt
python
llm
+4

HomeFeedback