Voice AIWhite-LabelAgencyQuality Assurance

Voice AI Quality Assurance Playbook for Agencies

Ming Xu
Ming XuChief Information Officer
·
Voice AI Quality Assurance Playbook for Agencies

Voice AI Quality Assurance Playbook for Agencies

Voice AI quality assurance is the process of systematically reviewing AI call transcripts, scoring agent performance against defined criteria, and fixing issues before clients notice them. Agencies that run structured QA retain clients at 2x the rate of those that deploy agents and walk away. The difference between a churned client and a 12-month retainer is almost always whether someone caught a wrong answer in week two, not whether the AI sounded good on the demo call.

This playbook covers the full QA framework: what to check, how often, how to score each call, and how to scale review across 10, 20, and 50 clients without hiring a team. It also includes a 5-point quality scorecard you can use starting today, a breakdown of the most common quality issues and their fixes, and strategies for turning QA data into agent improvements over time.

The QA Framework: What to Check, How Often, and What to Fix

Agency QA breaks into three cadences: daily review during the critical first 14 days, weekly sampling once the agent stabilizes, and monthly full audits across all clients. Each cadence catches a different category of problem, and skipping any one of them creates blind spots that eventually surface as client complaints.

The first 14 days after deployment are when most agent errors occur. The knowledge base is fresh, the conversation flows are untested with real callers, and edge cases surface that no amount of pre-launch testing predicts. After 14 days, patterns stabilize and you shift to sampling. Monthly audits catch slow drift: outdated pricing, seasonal changes the client forgot to mention, or gradual shifts in caller intent that the original configuration did not anticipate.

Daily Review (First 14 Days per Client)

Every transcript gets read during the first two weeks. Not sampled. Every single one. This is where you catch the problems that, left unfixed, become the reason a client cancels in month two.

What to look for each day:

Daily review process (15-20 minutes per client):

  1. Pull all transcripts from the previous 24 hours

  2. Read each transcript start to finish

  3. Flag any issues using the 5-point scorecard (covered below)

  4. Fix critical issues immediately (wrong pricing, broken booking flow)

  5. Log non-critical issues for the weekly knowledge base update

  6. Send the client a brief update if you fixed anything: "We caught a pricing discrepancy on your after-hours rate and corrected it this morning"

Weekly Review (After 14 Days)

Once an agent has been live for two weeks with issues resolved, you shift to sampling 20-30% of transcripts. The goal changes from catching deployment errors to monitoring for drift and identifying improvement opportunities.

What changes at the weekly cadence:

Weekly review process (30-45 minutes per client):

  1. Pull the full transcript list from the past 7 days

  2. Select 20-30% of transcripts using stratified sampling (mix of days, times, call types)

  3. Score each sampled transcript on the 5-point scorecard

  4. Compare scores to previous weeks to identify trends

  5. Compile a list of knowledge base updates needed

  6. Push updates to the agent's knowledge base and conversation flows

Monthly Full Audit (All Clients)

Once per month, run a full audit across every client's agent. This is not a transcript-by-transcript review. It is a strategic assessment of whether each agent is still aligned with the client's current business.

Monthly audit checklist:

The 5-Point Quality Scorecard

Score every reviewed call on five dimensions, each rated 1 (fail), 2 (needs improvement), or 3 (pass). A call scoring below 12 out of 15 needs investigation. An agent consistently scoring below 12 across multiple calls needs immediate intervention.

1. Greeting Accuracy

The agent should identify the business by name, match the expected tone, and orient the caller within the first 10 seconds. A dental office agent that opens with "Hi, how can I help you today?" without naming the practice fails this criterion. The caller needs confirmation they reached the right place.

Pass (3): Names the business, uses the correct greeting script, matches the brand's tone Needs improvement (2): Names the business but uses a generic or mismatched tone Fail (1): Does not name the business, uses the wrong business name, or opens with a confusing prompt

2. Information Correctness

Every fact the agent states must be verifiable against the client's current knowledge base. This includes pricing, service descriptions, availability, location details, and policy information. Partial correctness counts as a fail because callers cannot distinguish between "mostly right" and "completely right."

Pass (3): All stated facts match the current knowledge base Needs improvement (2): Minor inaccuracies that did not affect the outcome (e.g., rounding a price to the nearest $10) Fail (1): Stated incorrect pricing, hours, services, or policies that could mislead the caller

3. Appointment Handling

If the caller expressed any interest in booking, the agent should have attempted to schedule. "I'll have someone call you back" when the calendar integration is active and the caller is ready to book is a missed conversion.

Pass (3): Offered booking when appropriate, confirmed details, provided confirmation Needs improvement (2): Offered booking but missed a detail (no confirmation, wrong appointment type) Fail (1): Did not offer booking when the caller was clearly interested, or booked with incorrect details

4. Escalation Handling

Certain calls must reach a human: emergencies, complex complaints, high-value opportunities the client wants to handle personally, or situations where the caller explicitly asks to speak with someone. The agent should recognize these triggers and route accordingly.

Pass (3): Correctly identified escalation triggers and routed to the appropriate person or sent an urgent notification Needs improvement (2): Identified the need for escalation but used a slow channel (email instead of SMS for an emergency) Fail (1): Missed an obvious escalation trigger, or told the caller "someone will get back to you" for an emergency

5. Tone and Professionalism

The agent should sound natural, match the client's brand personality, and handle unexpected questions gracefully. Robotic repetition of scripts, abrupt topic changes, and failure to acknowledge caller frustration all fall here.

Pass (3): Natural conversation flow, appropriate empathy, brand-aligned language Needs improvement (2): Mostly natural but with one or two stilted responses or missed emotional cues Fail (1): Robotic or inappropriate responses, failure to acknowledge caller concerns, or brand-mismatched tone

Using the Scorecard

Dimension

Weight

Score Range

Greeting accuracy

Equal

1-3

Information correctness

Equal

1-3

Appointment handling

Equal

1-3

Escalation handling

Equal

1-3

Tone and professionalism

Equal

1-3

Total

5-15

Scoring thresholds:

Common Quality Issues and How to Fix Them

Four categories of quality issues account for roughly 80% of all agent errors agencies encounter. Each one has a specific root cause and a specific fix, not a vague "retrain the agent" response.

Wrong Pricing

The agent quotes a price that does not match the client's current rates. This happens most often when the client changes pricing after initial deployment and forgets to notify the agency, or when the website scrape captured promotional pricing that has since expired.

How to fix this: Update the knowledge base with current pricing immediately. Set a calendar reminder to verify pricing with the client monthly. For clients with seasonal or promotional pricing, build a pricing update schedule into the onboarding agreement. Some agencies include a clause requiring clients to notify them of pricing changes within 48 hours.

Outdated Hours or Availability

The agent states business hours that are no longer accurate, particularly around holidays, seasonal schedules, or recently changed operating hours.

How to fix this: Cross-reference the agent's hours against the client's Google Business Profile during monthly audits. GBP is usually the first place clients update their hours. If they differ, the knowledge base needs updating. For handling these kinds of errors proactively, build a pre-holiday check into your workflow two weeks before major holidays.

Missed Booking Opportunities

The caller said something like "Can I come in Thursday?" or "Do you have any openings this week?" and the agent responded with "I'll pass that along" instead of checking the calendar and booking the slot.

How to fix this: This is usually a conversation flow problem, not a knowledge base problem. Review the agent's booking triggers. The agent should recognize scheduling intent from phrases like "when can I," "do you have availability," "I'd like to schedule," and "can I book." If the calendar integration is active and the agent still deflects, the booking flow needs adjustment. If the integration is not active, connect it.

Failure to Escalate Emergencies

The most consequential quality failure. A plumber's emergency call gets the standard "we'll call you back during business hours" treatment. A medical office's urgent patient call gets routed to voicemail.

How to fix this: Define emergency keywords and scenarios for each vertical. Plumbing: "burst pipe," "flooding," "water everywhere," "gas leak." Medical: "chest pain," "can't breathe," "allergic reaction," "emergency." Legal: "arrested," "in custody," "court tomorrow." Program these as high-priority escalation triggers that send immediate SMS notifications to the client. Test these triggers monthly. A single missed emergency call can end a client relationship permanently.

How to Use QA Data to Improve Agents Over Time

QA is not just about catching errors. The transcripts you review contain a map of every conversation your agents handle, and that map reveals exactly where to improve. Agencies that treat QA as a feedback loop, rather than a checkbox, build agents that get measurably better every month.

Knowledge Base Updates

Every QA session should produce a list of knowledge base additions. When callers ask questions the agent cannot answer, that question belongs in the knowledge base. Track the frequency of unanswered questions across all clients in the same vertical. If three different dental clients' agents all struggle with "do you accept my insurance," the answer template belongs in your dental onboarding playbook, not just in one client's knowledge base.

What to do: Maintain a running log of unanswered or poorly answered questions per client. Batch knowledge base updates weekly rather than after every call. Group similar questions and write one comprehensive answer rather than multiple narrow ones. After updating, test the agent with the exact phrasing callers used, not your own paraphrased version.

Conversation Flow Adjustments

Some issues are not about what the agent knows but about how it navigates the conversation. If callers consistently get stuck in loops, receive redundant questions, or experience abrupt topic changes, the conversation architecture needs adjustment.

Common flow problems and fixes:

Vertical Pattern Libraries

After managing 5-10 clients in the same vertical, you will have enough QA data to build a pattern library: the most common caller questions, the most effective agent responses, and the most frequent failure modes for that industry. This library becomes your competitive advantage. New clients in the same vertical deploy faster and perform better from day one because you have already solved their industry's common problems.

QA at Scale: 10 vs 20 vs 50 Clients

Reviewing every transcript for every client does not scale. The framework above handles individual clients, but agencies need a system for deciding where to spend QA time as the portfolio grows. The weekly workflow for managing voice agent clients covers operational cadence. This section covers QA-specific scaling.

10 Clients: Manual Is Still Viable

At 10 clients with an average of 15-20 calls per client per week, you are reviewing roughly 30-60 sampled transcripts per week (20-30% of 150-200 total). That is 3-5 hours of QA work. Still manageable for a solo operator.

What to prioritize: Maintain the full framework. Daily review for any client in their first 14 days. Weekly sampling for established clients. Monthly audits for all. At this scale, you know each client's agent well enough to spot problems quickly.

20 Clients: Tiered Prioritization

At 20 clients, you are looking at 60-120 sampled transcripts per week. That is 6-10 hours of QA. Still possible solo, but you need to tier your clients.

Tiering strategy:

50 Clients: Automated Flagging Required

At 50 clients, manual review of even 10% of transcripts means reading 75-100 transcripts per week. You need automated flagging to direct your attention.

Automated flagging criteria:

At this scale, your QA process becomes:

  1. Automated flags filter the full transcript volume down to 15-20% that need human review

  2. You review flagged transcripts only (approximately 150-200 per week across 50 clients)

  3. Monthly audits shift to quarterly for Tier 3 clients with perfect automated scores

  4. You hire or contract a QA specialist for transcript review, keeping strategic decisions (knowledge base architecture, conversation flow design) for yourself

Trillet's white-label voice AI platform provides call transcripts, summaries, and analytics dashboards that support this tiered QA workflow. As of June 2026, the Agency plan is $299/month with unlimited sub-accounts.

The Honest Caveat

No QA process catches everything. Voice AI agents will occasionally give a wrong answer, miss a booking cue, or handle an edge case poorly. The goal of QA is not perfection. It is reducing the error rate to a level where the client's callers receive consistently good service and the client sees measurable value from the agent every month. An agent that handles 85-90% of calls well and gracefully escalates the rest is outperforming the voicemail box it replaced. The agencies that struggle with QA are not the ones with imperfect agents. They are the ones with no system for finding and fixing problems before the client does.

Frequently Asked Questions

How long should QA take per client per week?

During the first 14 days, expect 15-20 minutes per day reviewing all transcripts for a client handling 10-15 calls daily. After the initial period, weekly sampling takes 20-30 minutes per client. Monthly audits add 30-45 minutes per client once per month. At 20 clients past the 14-day mark, total weekly QA time is 7-10 hours.

What is the minimum viable QA process for a solo agency?

At minimum, review every transcript for the first 14 days per client, then sample 20% weekly. Score each sampled call on the 5-point scorecard. Push knowledge base fixes within 24 hours of identifying them. Skip any of these three steps and you are guessing about agent quality instead of measuring it.

Should I share QA scores with clients?

Share the outcomes, not the raw scores. Clients want to know their agent handled 95% of calls correctly this month, that you fixed two pricing discrepancies, and that booking rates increased 12% after a conversation flow adjustment. They do not want a spreadsheet of 1-3 ratings for individual calls. The QA process is your internal tool. The results are your client deliverable.

How do I handle a client who reports an issue before I catch it in QA?

Acknowledge it immediately, fix it within hours (not days), and send a brief explanation of what happened and what you changed. Then review your QA process to understand why you missed it. If the issue was in a transcript you sampled but overlooked, you need sharper review criteria. If it was in an unsampled transcript, consider increasing your sampling rate for that client temporarily.

Can QA be fully automated?

Not yet. Automated tools can flag anomalies like dropped calls, long durations, and uncertainty phrases. But determining whether the agent gave a contextually correct answer, matched the client's brand tone, or missed a subtle booking cue still requires human judgment. The best approach as of June 2026 is automated flagging paired with human review of flagged calls.

Related Resources

Related Articles

AI Receptionist Proposal Template for Agencies
White-LabelAgencyVoice AI+1

AI Receptionist Proposal Template for Agencies

A copy-paste AI agency proposal template with seven sections, one-number pricing, and vertical customization that converts 2-3x better than verbal quotes.

Ming Xu
Ming XuChief Information Officer
Weekly Research — April 12, 2026
Industry InsightsUse Cases

Weekly Research — April 12, 2026

Stop wasting hours scrolling through endless data feeds. We’ve distilled this week’s top research into actionable insights you can use immediately.

Ming Xu
Ming XuChief Information Officer
Weekly Research — April 8, 2026 (Trial Run)
Industry InsightsUse Cases

Weekly Research — April 8, 2026 (Trial Run)

Stop scrolling and start winning with this week’s essential research insights. Master the latest trends in minutes to keep your competitive edge sharp.

Ming Xu
Ming XuChief Information Officer