Years ago, Google demoed
Duplex,
a voice AI agent that could make phone calls and book appointments. When I first
started working at Belva AI, I was tasked with working on the engineering team
that would build Belva's competitor to Duplex.
This was one of the first of many products I helped develop at Belva, built
entirely from scratch using Python services, websockets, and series of AI models
for speech recognition, natural language understanding, and voice synthesis.
Since this project, Belva has pivoted in their products but the
experience of building a voice AI agent was a great learning experience.
Requirements
Creating a voice AI agent that could:
Handle natural conversations over phone calls
Make reservations and schedule appointments
Process real-time audio streams
Maintain conversation context throughout calls
Log calls for quality assurance with result and transcript summary
Challenges
2-Way Real-Time Communication
'Full-Duplex' is a 2-way real-time communication system where
both parties can talk at the same time.
To meet the requirements, we needed a Full-Duplex communication system. The
reason is this method would best mimic a human-to-human conversation where two
parties may interject at the same time.
txt
1AI: What time are you available?
2Human: Uh, I'm not sure.
3AI: That's--
4Human: **interrupting** actually 6pm would be good
5AI: 6pm, I'll make a note
6
On the backend, the Python code had to have multiple threads to handle the
communication between the AI and the human. The challenge was managing the
threads and ensuring the communication was seamless within acceptable latency.
We ran one thread to receive and one for sending. The receive thread would
listen for the human's response and send it to the LLM.
Human-like
Voice calling has unique challenges because of the nature of human-to-human
speech. People speak differently than how they type. A person may say Um, uh, I
think, yeah that'd be good. The hardest part was accounting for the pauses
because the response from the AI would start after stopping bytes from being
received. We solved this by testing timing.
Context Retention
At the time of this project, LLMs had a limited context window. This meant that
the agent would forget what was said earlier in the conversation.
We solved this by using a parallel LLM to summarize the conversation and keep
the context; a kind of compressor which would reduce the text size but retain
main ideas.
At the time, we were building the agent from scratch. Today there are many out
of the box options such as Vapi and Cartesia. But the opportunity to build a
complex system from scratch was a great learning experience because it
challenged our team's creativity and problem-solving skills.