Voice AI Technology Stack Explained for Agency Owners

A voice AI technology stack is five layers working in sequence: telephony (how the call connects), speech-to-text (how the AI hears), NLU/LLM (how the AI understands and decides what to say), text-to-speech (how the AI speaks), and orchestration (how the call flows between actions like booking, transferring, or taking a message). Each layer adds latency. A native voice AI platform like Trillet runs all five layers on its own infrastructure to keep total response latency low (typically in the 1.5 to 2.5 second range end to end). Wrapper platforms stack additional layers on top of third-party providers like Vapi or Retell, which means more hops, more latency, and more points of failure.

This article breaks down each layer in plain language with enough technical depth that you can explain the difference between platforms to a skeptical client or evaluate vendors without relying on their marketing pages.

Layer 1: Telephony (How Calls Connect)

Telephony is the entry point. When a caller dials a business number, the call travels through the phone network (the Public Switched Telephone Network, or PSTN) and lands on the voice AI platform via a SIP trunk, which is essentially a digital phone line that connects traditional phone calls to internet-based systems.

For agency deployments, the important detail is that the client's business number does not change. The client sets up conditional call forwarding, the same mechanism their phone already uses for voicemail, so unanswered calls route to the AI instead of a voicemail box. The AI platform provides a destination phone number. The caller never knows anything changed.

Technical aside for the curious: SIP (Session Initiation Protocol) handles the signaling, setting up and tearing down the call. The actual audio travels over RTP (Real-Time Protocol). The quality of this layer depends on the platform's telephony provider, codec selection (the algorithm that compresses and decompresses the audio), and how close their servers are to the caller. A platform using a single, well-configured telephony provider will have lower jitter and packet loss (jitter is variation in audio timing; packet loss is audio data that never arrives) than one routing through multiple intermediaries.

What to look for: Ask whether the platform owns its telephony relationship or routes through a third party. Every intermediary between the caller and the AI adds 50-150 milliseconds of latency and another potential failure point. The ITU-T G.114 recommendation, the long-standing international standard for one-way transmission time, sets 150 milliseconds as the threshold below which one-way delay is acceptable for most conversational applications; beyond it, callers begin to perceive degradation in conversational quality.

Layer 2: STT (How the AI Hears)

Speech-to-text, also called automatic speech recognition (ASR), converts the caller's spoken words into text that the AI can process. This happens in real time, streaming the audio as the caller speaks rather than waiting for them to finish.

Modern STT engines handle accents, background noise, and domain-specific vocabulary (think "HVAC compressor" or "crown replacement" in dental contexts) with 85-95% accuracy for routine calls. The quality varies by provider, and accuracy rates differ significantly across platforms.

Technical aside: STT models run on GPU-accelerated servers (specialized processors designed for the parallel math that speech recognition requires). The two primary approaches are streaming recognition (processing audio in chunks as it arrives, typically in 100-300 millisecond windows) and batch recognition (waiting for a full utterance before processing). Streaming is what voice AI uses because waiting for the caller to finish a full sentence before starting to process it would add seconds of dead air.

Why this matters for agencies: STT quality directly affects whether the AI understands a caller's intent correctly. If the STT layer mishears "I need an emergency plumber" as "I need an emergency number," every layer downstream gives the wrong answer. When evaluating platforms, ask about noise handling and accent support for the verticals you plan to serve.

Layer 3: NLU/LLM (How the AI Understands and Responds)

This is the brain of the stack. Natural Language Understanding (NLU) and Large Language Models (LLMs) take the transcribed text from the STT layer and determine two things: what the caller wants (intent recognition) and what the AI should say or do next (response generation).

Older voice AI systems used rigid decision trees: if the caller says X, respond with Y. Modern platforms use LLMs (the same underlying technology as ChatGPT) to understand freeform conversation. A caller can say "my kitchen faucet has been dripping for a week and now there is water on the floor," and the LLM recognizes this as a plumbing emergency requiring a same-day appointment, not a routine maintenance inquiry. It can ask clarifying questions, handle topic changes mid-call, and access the business's knowledge base to give accurate answers about pricing, hours, and services.

Technical aside: The LLM inference step is often the largest source of latency in the stack. A typical LLM call takes 200-800 milliseconds depending on the model, prompt complexity, and server load. Platforms optimize this through prompt engineering (crafting shorter, more targeted instructions for the AI), model selection (smaller, faster models for simple tasks), and inference caching (storing and reusing responses for common queries so the AI does not regenerate them from scratch). Some platforms use a technique called speculative generation, starting to produce audio output before the full response is generated, to reduce perceived latency.

What to look for: A platform that lets you use your client's website content and reviews as the AI's knowledge base, rather than requiring you to write hundreds of scripted responses manually. Website scraping cuts onboarding time by 60-70% compared to manual FAQ entry.

Layer 4: TTS (How the AI Speaks)

Text-to-speech converts the LLM's text response back into spoken audio that the caller hears. Modern TTS engines produce voices that sound natural, with appropriate pacing, intonation, and breathing patterns. The robotic monotone of older TTS systems (think early GPS navigation voices) is largely a solved problem at the platform level, though quality still varies between providers.

TTS adds its own latency. The platform needs to generate the audio, stream it back through the telephony layer, and play it to the caller. The best TTS engines start streaming audio within 100-200 milliseconds of receiving text, which is fast enough that the caller does not perceive a gap between finishing their sentence and hearing the AI respond.

Technical aside: TTS models come in two main categories. Concatenative synthesis stitches together pre-recorded speech fragments. Neural synthesis (used by modern platforms) generates audio from scratch using neural networks, producing more natural prosody and inflection. Some platforms offer voice cloning, training a TTS model on a specific voice sample so the AI sounds like a particular person. Production-grade voice cloning requires careful tuning to avoid uncanny valley effects.

Honest caveat: TTS quality has improved dramatically, but it is not perfect. Uncommon proper nouns, long numerical sequences, and code-switching between languages mid-sentence can still produce awkward output. Agencies should test their clients' specific use cases, including the names of local streets, competitor businesses, and industry terminology, before going live.

Layer 5: Orchestration (How Calls Flow)

Orchestration is the coordination layer that ties everything together. It determines what happens after the AI understands the caller's intent: book an appointment, transfer the call to a human, take a message, send an SMS confirmation, log the interaction to a CRM, or some combination of these actions.

This layer manages the call state machine. It tracks where the conversation is, what the AI has already asked, what information it still needs, and what actions to trigger at each decision point. For example: if a caller asks to book an appointment, orchestration checks the business's calendar in real time, offers available slots, confirms the booking, sends an SMS to the caller, and emails a summary to the business owner, all within the same call.

What to look for: The orchestration layer is where platform architectures diverge most. Some platforms use static flow builders where you draw conversation paths as flowcharts. Others use dynamic conversation architecture where the AI determines the path in real time based on the caller's responses. Dynamic systems handle unexpected questions and topic changes more naturally. Static flows can break when a caller goes off-script.

Trillet uses dynamic conversation architecture with a feature called Crews for multi-agent orchestration, allowing context-isolated handoffs between specialized agents within a single call. Plans start at $99/month (Studio) or $299/month (Agency) with $0.12/minute usage.

Honest caveat: Dynamic conversation architecture is more flexible than static flowcharts, but it is not a free pass. Because the AI determines the path in real time, edge cases are harder to predict than with a rigid flow you can map end to end. Trillet gives you the multi-agent structure, but agencies still need to test their clients' specific scenarios and define clear guardrails, especially for high-stakes actions like booking, payment collection, or transfers. The flexibility pays off, but only if you invest in the upfront testing that any production voice deployment requires.

If you want to evaluate a native, agency-ready stack for yourself, explore the Trillet white-label platform and the complete white-label voice AI guide, which walks through how the full five-layer stack maps to agency margins and reseller economics.

Why Latency Compounds Across Layers

Each of these five layers adds processing time. In the best case, the total adds up to about 2 seconds end-to-end: the time between the caller finishing a sentence and hearing the AI respond. That 2-second window falls within the natural conversational pause threshold, so callers do not notice.

Here is a rough breakdown of where the time goes:

Layer	Typical Latency
Telephony (SIP/RTP)	50-150 ms
STT (speech to text)	100-300 ms
LLM (understanding + response)	200-800 ms
TTS (text to speech)	100-200 ms
Network round-trips	50-100 ms
Total	500-1,550 ms

Add telephony overhead on both ends and you land at approximately 1.5-2.5 seconds total. The problem arises when a platform adds extra layers. A wrapper platform built on top of Vapi, for example, routes the call through its own layer, then Vapi's layer, then the underlying LLM and TTS providers. Each additional hop adds latency and a potential failure point. As of June 2026, voice AI latency benchmarks show that native platforms consistently deliver faster end-to-end response times than wrapper alternatives.

Why Agencies Should Care About the Stack

Understanding the technology stack is not about becoming an engineer. It is about being able to evaluate platforms, explain the technology to clients in a sales conversation, and diagnose problems when something goes wrong.

Three practical reasons this matters:

Latency is a client retention issue. If the AI takes 4-5 seconds to respond, callers hang up. The natural human conversational gap sits around 200-300 milliseconds, and as a widely observed pattern in voice AI deployments, response delays past one to two seconds make the interaction stop feeling like a conversation and push abandonment rates sharply higher. Your clients will blame you, not the platform. Knowing that latency compounds across layers helps you choose a platform that stays below the threshold.

Wrapper architectures create support dead zones. When a wrapper platform has an issue, the wrapper vendor says "it is Vapi's problem," Vapi says "it is the LLM provider's problem," and nobody fixes it. Understanding the stack lets you identify where the actual failure point is and whether your platform vendor has the ability to resolve it.

Your pricing depends on it. Per-minute costs are driven by the underlying infrastructure. If a platform charges $0.12/minute, that cost is split across telephony, STT, LLM inference, and TTS usage. Platforms with more layers have higher cost floors. Understanding where the costs come from helps you set pricing that protects your margins, a core part of building a sustainable AI chatbot agency business model.

What to do: Before committing to a platform, ask three questions. Does the platform own its entire stack or depend on third-party providers? What is the measured end-to-end latency in production (not a demo)? And what happens to your clients if the platform's upstream provider goes down?

Frequently Asked Questions

What is the voice AI technology stack?

The voice AI technology stack is the set of five interconnected layers that make AI phone calls work: telephony (call connection), STT (converting speech to text), NLU/LLM (understanding intent and generating responses), TTS (converting text back to speech), and orchestration (managing call flow, bookings, transfers, and follow-ups). Each layer processes in sequence during every call, and the total latency across all layers determines how fast the AI responds.

What is the difference between STT and TTS in voice AI?

STT (speech-to-text) converts the caller's spoken words into text so the AI can process them. TTS (text-to-speech) does the reverse, converting the AI's text response back into natural-sounding audio. STT happens when the caller speaks; TTS happens when the AI speaks. Both add latency to the call, typically 100-300 milliseconds each.

Why are wrapper platforms slower than native platforms?

Wrapper platforms add an extra layer between the agency and the underlying voice AI infrastructure. A call on a wrapper goes through the wrapper's servers, then to the provider (like Vapi or Retell), then to the LLM and TTS providers. Each additional hop adds 50-150 milliseconds of latency and creates another potential failure point. Native platforms process all five layers on their own infrastructure, removing intermediary hops.

Do agencies need to understand the tech stack to sell voice AI?

Not at an engineering level, but enough to answer three client questions: why does the AI respond quickly (latency across layers), why is it reliable (fewer dependencies means fewer failures), and how does it know my business (knowledge base training via website scraping). Understanding the stack at this level helps you differentiate between platforms during evaluation and handle objections during sales conversations.

What is orchestration in voice AI?

Orchestration is the coordination layer that determines what happens during and after a call. It manages appointment booking, call transfers, message taking, SMS confirmations, CRM logging, and follow-up workflows. The orchestration layer is where platforms differ most: some use static flowcharts, others use dynamic conversation architecture that adapts to the caller in real time.

Updated for June 2026: corrected technical citations (ITU-T G.114 for the 150 ms one-way delay threshold), reframed latency-and-abandonment evidence as a general industry observation, and added an honest caveat on dynamic conversation architecture.

Voice AI Technology Stack Explained for Agency Owners