Voice AI Quality Assurance Playbook for Agencies

Voice AI quality assurance is the process of systematically reviewing AI call transcripts, scoring agent performance against defined criteria, and fixing issues before clients notice them. In our experience, agencies that run structured QA tend to keep clients on retainer far longer than those that deploy agents and walk away. The difference between a churned client and a 12-month retainer is almost always whether someone caught a wrong answer in week two, not whether the AI sounded good on the demo call.

This is not unique to agencies. An MIT study reported by Fortune in August 2025 found that roughly 95% of enterprise generative AI pilots fail to deliver measurable returns, and the researchers traced the failure not to the underlying models but to a "learning gap" in implementation: tools that are never adapted to the specific workflow they are supposed to serve (Fortune, 2025). Structured QA is how an agency closes that gap, turning a generic agent into one that actually fits the client's business.

This playbook covers the full QA framework: what to check, how often, how to score each call, and how to scale review across 10, 20, and 50 clients without hiring a team. It also includes a 5-point quality scorecard you can use starting today, a breakdown of the most common quality issues and their fixes, and strategies for turning QA data into agent improvements over time.

The QA Framework: What to Check, How Often, and What to Fix

Agency QA breaks into three cadences: daily review during the critical first 14 days, weekly sampling once the agent stabilizes, and monthly full audits across all clients. Each cadence catches a different category of problem, and skipping any one of them creates blind spots that eventually surface as client complaints.

The first 14 days after deployment are when most agent errors occur. The knowledge base is fresh, the conversation flows are untested with real callers, and edge cases surface that no amount of pre-launch testing predicts. After 14 days, patterns stabilize and you shift to sampling. Monthly audits catch slow drift: outdated pricing, seasonal changes the client forgot to mention, or gradual shifts in caller intent that the original configuration did not anticipate.

Daily Review (First 14 Days per Client)

Every transcript gets read during the first two weeks. Not sampled. Every single one. This is where you catch the problems that, left unfixed, become the reason a client cancels in month two.

What to look for each day:

Factual errors: Wrong pricing, incorrect hours, outdated service descriptions, or inaccurate appointment availability. These are the fastest path to client complaints because the caller acts on bad information.
Missed booking opportunities: The caller expressed interest in scheduling but the agent did not offer to book. This is lost revenue the client can quantify.
Escalation failures: Emergency or urgent calls that should have been routed to a human but were not. A plumber's burst pipe call that gets "I'll pass along the message" instead of an immediate notification is a client-ending event.
Tone and phrasing issues: Responses that sound robotic, overly formal, or mismatched to the client's brand voice. A casual surf shop should not get responses that read like a law firm's intake script.
Caller drop-offs: Calls where the caller hung up mid-conversation. Review the transcript to identify what caused the abandonment, whether it was a long pause, a confusing question, or an incorrect response.

Daily review process (15-20 minutes per client):

Pull all transcripts from the previous 24 hours
Read each transcript start to finish
Flag any issues using the 5-point scorecard (covered below)
Fix critical issues immediately (wrong pricing, broken booking flow)
Log non-critical issues for the weekly knowledge base update
Send the client a brief update if you fixed anything: "We caught a pricing discrepancy on your after-hours rate and corrected it this morning"

Weekly Review (After 14 Days)

Once an agent has been live for two weeks with issues resolved, you shift to sampling 20-30% of transcripts. The goal changes from catching deployment errors to monitoring for drift and identifying improvement opportunities.

What changes at the weekly cadence:

Sample selection matters. Do not just grab the first 20% chronologically. Sample across different days of the week and different times of day. After-hours calls, Monday morning spikes, and weekend calls each produce different caller behaviors.
Focus on patterns, not individual calls. A single awkward phrasing is not a crisis. The same awkward phrasing appearing in 8 out of 40 calls is a conversation flow problem worth fixing.
Track metrics over time. Booking conversion rate, average call duration, and escalation frequency should be trending in the right direction. If booking rate drops from 35% to 20% over three weeks, something changed.

Weekly review process (30-45 minutes per client):

Pull the full transcript list from the past 7 days
Select 20-30% of transcripts using stratified sampling (mix of days, times, call types)
Score each sampled transcript on the 5-point scorecard
Compare scores to previous weeks to identify trends
Compile a list of knowledge base updates needed
Push updates to the agent's knowledge base and conversation flows

Monthly Full Audit (All Clients)

Once per month, run a full audit across every client's agent. This is not a transcript-by-transcript review. It is a strategic assessment of whether each agent is still aligned with the client's current business.

Monthly audit checklist:

Verify all pricing in the knowledge base matches the client's current pricing
Confirm business hours, holiday schedules, and seasonal changes are accurate
Review the top 10 most common caller questions and verify the agent's answers
Check booking conversion rates and compare to the client's expectations
Review any client feedback or complaints received during the month
Test the agent yourself with 2-3 calls simulating common scenarios
Update the client's monthly report with QA findings and improvements made. For a deeper framework on what metrics to track and alert on, build automated thresholds into your monthly reporting

The 5-Point Quality Scorecard

Score every reviewed call on five dimensions, each rated 1 (fail), 2 (needs improvement), or 3 (pass). A call scoring below 12 out of 15 needs investigation. An agent consistently scoring below 12 across multiple calls needs immediate intervention.

1. Greeting Accuracy

The agent should identify the business by name, match the expected tone, and orient the caller within the first 10 seconds. A dental office agent that opens with "Hi, how can I help you today?" without naming the practice fails this criterion. The caller needs confirmation they reached the right place.

Pass (3): Names the business, uses the correct greeting script, matches the brand's tone Needs improvement (2): Names the business but uses a generic or mismatched tone Fail (1): Does not name the business, uses the wrong business name, or opens with a confusing prompt

2. Information Correctness

Every fact the agent states must be verifiable against the client's current knowledge base. This includes pricing, service descriptions, availability, location details, and policy information. Partial correctness counts as a fail because callers cannot distinguish between "mostly right" and "completely right."

Pass (3): All stated facts match the current knowledge base Needs improvement (2): Minor inaccuracies that did not affect the outcome (e.g., rounding a price to the nearest $10) Fail (1): Stated incorrect pricing, hours, services, or policies that could mislead the caller

3. Appointment Handling

If the caller expressed any interest in booking, the agent should have attempted to schedule. "I'll have someone call you back" when the calendar integration is active and the caller is ready to book is a missed conversion.

Pass (3): Offered booking when appropriate, confirmed details, provided confirmation Needs improvement (2): Offered booking but missed a detail (no confirmation, wrong appointment type) Fail (1): Did not offer booking when the caller was clearly interested, or booked with incorrect details

4. Escalation Handling

Certain calls must reach a human: emergencies, complex complaints, high-value opportunities the client wants to handle personally, or situations where the caller explicitly asks to speak with someone. The agent should recognize these triggers and route accordingly.

Pass (3): Correctly identified escalation triggers and routed to the appropriate person or sent an urgent notification Needs improvement (2): Identified the need for escalation but used a slow channel (email instead of SMS for an emergency) Fail (1): Missed an obvious escalation trigger, or told the caller "someone will get back to you" for an emergency

5. Tone and Professionalism

The agent should sound natural, match the client's brand personality, and handle unexpected questions gracefully. Robotic repetition of scripts, abrupt topic changes, and failure to acknowledge caller frustration all fall here.

Pass (3): Natural conversation flow, appropriate empathy, brand-aligned language Needs improvement (2): Mostly natural but with one or two stilted responses or missed emotional cues Fail (1): Robotic or inappropriate responses, failure to acknowledge caller concerns, or brand-mismatched tone

Using the Scorecard

Dimension	Weight	Score Range
Greeting accuracy	Equal	1-3
Information correctness	Equal	1-3
Appointment handling	Equal	1-3
Escalation handling	Equal	1-3
Tone and professionalism	Equal	1-3
Total		5-15

Scoring thresholds:

13-15: Agent is performing well. Log the scores and move on.
10-12: Agent needs targeted fixes. Identify which dimensions are dragging the score down and prioritize those.
Below 10: Immediate intervention required. Review the knowledge base, conversation flows, and escalation rules before the next business day.

Common Quality Issues and How to Fix Them

Four categories of quality issues account for roughly 80% of all agent errors agencies encounter. Each one has a specific root cause and a specific fix, not a vague "retrain the agent" response.

Wrong Pricing

The agent quotes a price that does not match the client's current rates. This happens most often when the client changes pricing after initial deployment and forgets to notify the agency, or when the website scrape captured promotional pricing that has since expired.

How to fix this: Update the knowledge base with current pricing immediately. Set a calendar reminder to verify pricing with the client monthly. For clients with seasonal or promotional pricing, build a pricing update schedule into the onboarding agreement. Some agencies include a clause requiring clients to notify them of pricing changes within 48 hours.

Outdated Hours or Availability

The agent states business hours that are no longer accurate, particularly around holidays, seasonal schedules, or recently changed operating hours.

How to fix this: Cross-reference the agent's hours against the client's Google Business Profile during monthly audits. GBP is usually the first place clients update their hours. If they differ, the knowledge base needs updating. For handling these kinds of errors proactively, build a pre-holiday check into your workflow two weeks before major holidays.

Missed Booking Opportunities

The caller said something like "Can I come in Thursday?" or "Do you have any openings this week?" and the agent responded with "I'll pass that along" instead of checking the calendar and booking the slot.

How to fix this: This is usually a conversation flow problem, not a knowledge base problem. Review the agent's booking triggers. The agent should recognize scheduling intent from phrases like "when can I," "do you have availability," "I'd like to schedule," and "can I book." If the calendar integration is active and the agent still deflects, the booking flow needs adjustment. If the integration is not active, connect it.

Failure to Escalate Emergencies

The most consequential quality failure. A plumber's emergency call gets the standard "we'll call you back during business hours" treatment. A medical office's urgent patient call gets routed to voicemail.

How to fix this: Define emergency keywords and scenarios for each vertical. Plumbing: "burst pipe," "flooding," "water everywhere," "gas leak." Medical: "chest pain," "can't breathe," "allergic reaction," "emergency." Legal: "arrested," "in custody," "court tomorrow." Program these as high-priority escalation triggers that send immediate SMS notifications to the client. Test these triggers monthly. A single missed emergency call can end a client relationship permanently.

How to Use QA Data to Improve Agents Over Time

QA is not just about catching errors. The transcripts you review contain a map of every conversation your agents handle, and that map reveals exactly where to improve. Agencies that treat QA as a feedback loop, rather than a checkbox, build agents that get measurably better every month.

Knowledge Base Updates

Every QA session should produce a list of knowledge base additions. When callers ask questions the agent cannot answer, that question belongs in the knowledge base. Track the frequency of unanswered questions across all clients in the same vertical. If three different dental clients' agents all struggle with "do you accept my insurance," the answer template belongs in your dental onboarding playbook, not just in one client's knowledge base.

What to do: Maintain a running log of unanswered or poorly answered questions per client. Batch knowledge base updates weekly rather than after every call. Group similar questions and write one comprehensive answer rather than multiple narrow ones. After updating, test the agent with the exact phrasing callers used, not your own paraphrased version.

Conversation Flow Adjustments

Some issues are not about what the agent knows but about how it navigates the conversation. If callers consistently get stuck in loops, receive redundant questions, or experience abrupt topic changes, the conversation architecture needs adjustment.

Common flow problems and fixes:

The agent asks for information the caller already provided. This usually means the conversation memory is not carrying context between turns. Review the agent's configuration to ensure extracted data persists through the call.
The agent jumps to booking before qualifying. Some callers need qualification questions answered first (service area, insurance acceptance, appointment type). Adjust the flow to qualify before offering to schedule.
The agent does not know when to stop talking. If transcripts show the agent adding unnecessary information after the caller's question has been answered, tighten the response parameters.

Vertical Pattern Libraries

After managing 5-10 clients in the same vertical, you will have enough QA data to build a pattern library: the most common caller questions, the most effective agent responses, and the most frequent failure modes for that industry. This library becomes your competitive advantage. New clients in the same vertical deploy faster and perform better from day one because you have already solved their industry's common problems.

QA at Scale: 10 vs 20 vs 50 Clients

Reviewing every transcript for every client does not scale. The framework above handles individual clients, but agencies need a system for deciding where to spend QA time as the portfolio grows. The weekly workflow for managing voice agent clients covers operational cadence, and the broader white-label voice AI guide sets QA in the context of building and running a profitable agency. This section covers QA-specific scaling.

10 Clients: Manual Is Still Viable

At 10 clients with an average of 15-20 calls per client per week, you are reviewing roughly 30-60 sampled transcripts per week (20-30% of 150-200 total). That is 3-5 hours of QA work. Still manageable for a solo operator.

What to prioritize: Maintain the full framework. Daily review for any client in their first 14 days. Weekly sampling for established clients. Monthly audits for all. At this scale, you know each client's agent well enough to spot problems quickly.

20 Clients: Tiered Prioritization

At 20 clients, you are looking at 60-120 sampled transcripts per week. That is 6-10 hours of QA. Still possible solo, but you need to tier your clients.

Tiering strategy:

Tier 1 (high-touch): New clients (first 30 days), clients with recent complaints, and clients in high-stakes verticals (medical, legal). Review 30-40% of transcripts weekly.
Tier 2 (standard): Established clients performing well. Review 15-20% of transcripts weekly.
Tier 3 (low-touch): Clients with 3+ months of clean QA scores and no complaints. Review 10% of transcripts weekly, but never drop below 5 calls per week per client.

50 Clients: Automated Flagging Required

At 50 clients, manual review of even 10% of transcripts means reading 75-100 transcripts per week. You need automated flagging to direct your attention.

Automated flagging criteria:

Calls where the caller hung up before the agent completed the conversation
Calls where the agent said "I don't have that information" or similar uncertainty phrases
Calls longer than 2x the average duration for that client (likely indicates the agent got stuck in a loop)
Calls where no booking was made despite the caller mentioning scheduling-related keywords
Calls with negative sentiment indicators

At this scale, your QA process becomes:

Automated flags filter the full transcript volume down to 15-20% that need human review
You review flagged transcripts only (approximately 150-200 per week across 50 clients)
Monthly audits shift to quarterly for Tier 3 clients with perfect automated scores
You hire or contract a QA specialist for transcript review, keeping strategic decisions (knowledge base architecture, conversation flow design) for yourself

Trillet's white-label voice AI platform provides call transcripts, summaries, and analytics dashboards that support this tiered QA workflow. As of June 2026, white-label plans start at $99/month, with the Agency plan at $299/month including unlimited sub-accounts and per-minute usage around $0.12. See the full white-label voice AI guide for how QA fits into agency operations and margins.

The Honest Caveat

No QA process catches everything. Voice AI agents will occasionally give a wrong answer, miss a booking cue, or handle an edge case poorly. The goal of QA is not perfection. It is reducing the error rate to a level where the client's callers receive consistently good service and the client sees measurable value from the agent every month. An agent that handles 85-90% of calls well and gracefully escalates the rest is outperforming the voicemail box it replaced. The agencies that struggle with QA are not the ones with imperfect agents. They are the ones with no system for finding and fixing problems before the client does.

Frequently Asked Questions

How long should QA take per client per week?

During the first 14 days, expect 15-20 minutes per day reviewing all transcripts for a client handling 10-15 calls daily. After the initial period, weekly sampling takes 20-30 minutes per client. Monthly audits add 30-45 minutes per client once per month. At 20 clients past the 14-day mark, total weekly QA time is 7-10 hours.

What is the minimum viable QA process for a solo agency?

At minimum, review every transcript for the first 14 days per client, then sample 20% weekly. Score each sampled call on the 5-point scorecard. Push knowledge base fixes within 24 hours of identifying them. Skip any of these three steps and you are guessing about agent quality instead of measuring it.

Should I share QA scores with clients?

Share the outcomes, not the raw scores. Clients want to know their agent handled 95% of calls correctly this month, that you fixed two pricing discrepancies, and that booking rates increased 12% after a conversation flow adjustment. They do not want a spreadsheet of 1-3 ratings for individual calls. The QA process is your internal tool. The results are your client deliverable.

How do I handle a client who reports an issue before I catch it in QA?

Acknowledge it immediately, fix it within hours (not days), and send a brief explanation of what happened and what you changed. Then review your QA process to understand why you missed it. If the issue was in a transcript you sampled but overlooked, you need sharper review criteria. If it was in an unsampled transcript, consider increasing your sampling rate for that client temporarily.

Can QA be fully automated?

Not yet. Automated tools can flag anomalies like dropped calls, long durations, and uncertainty phrases. But determining whether the agent gave a contextually correct answer, matched the client's brand tone, or missed a subtle booking cue still requires human judgment. The best approach as of June 2026 is automated flagging paired with human review of flagged calls.

Updated for June 2026: Refreshed the retention framing with a cited third-party data point (MIT/Fortune), confirmed current Trillet white-label pricing, and updated internal links.

Voice AI Quality Assurance Playbook for Agencies