Choosing a transcription vendor without a pilot is risky. A good pilot shows how the vendor handles your real audio, your accuracy needs, and your turnaround requirements before you commit.
The best way to pilot a transcription vendor is to test a small set of real recordings, use clear acceptance criteria, score each output the same way, and make a go/no-go decision based on evidence. This guide gives you a practical plan, a test script, and a scoring sheet you can use right away.
Key takeaways
- Use real recordings from your workflow, not only clean sample audio.
- Define acceptance criteria before the test starts.
- Check more than word accuracy, including names, numbers, speaker labels, formatting, and turnaround time.
- Score every vendor with the same checklist.
- Make a go/no-go decision based on must-pass items and total score.
Why a transcription vendor pilot matters
Many transcription vendors look similar until you test them on difficult audio. A pilot helps you see how they handle crosstalk, accents, technical terms, low volume, and speaker changes.
It also helps you avoid two common mistakes: choosing on price alone and choosing based on a polished demo. Your actual recordings are the only test that matters.
A structured pilot gives you a fair way to compare options. It also gives your team a shared decision process, which is useful when legal, research, media, or operations teams all care about different details.
Step 1: Choose the right test recordings
Your pilot should be small enough to manage and broad enough to reflect your real work. For most teams, 5 to 10 recordings is enough for a first pass.
What to include in the test set
- One clean recording with clear speakers
- One moderate-difficulty recording with minor background noise
- One difficult recording with overlapping speech or crosstalk
- One recording with names, product terms, or jargon
- One recording with important numbers, dates, amounts, or IDs
- Optional: one recording with accents, remote call audio, or multiple speakers
How long each recording should be
Keep each file short enough to review without creating a heavy burden. A practical range is 5 to 15 minutes per file.
That gives you enough material to test quality while keeping scoring consistent. If your workflow depends on long recordings, include one longer file as a separate stress test.
How to prepare the files
- Use the same source files for every vendor.
- Do not edit the audio to make it easier.
- Label files clearly, such as File A, File B, and File C.
- Remove sensitive data if needed, or confirm the vendor can handle it under your rules.
- Provide the same instructions, style guide, and glossary to all vendors.
What to send with the files
- Expected turnaround time
- Required output format, such as DOCX, TXT, or SRT
- Speaker label rules
- Formatting rules for dates, times, and numbers
- Any glossary of names, products, or technical terms
- Notes on sections that must be verbatim
If you need captions or subtitles as part of the workflow, test that separately with the same discipline. GoTranscript also offers closed caption services for teams that need timed text as part of delivery.
Step 2: Define acceptance criteria before the pilot starts
Acceptance criteria are the pass or fail rules for the pilot. Write them down before you see any results so you do not move the goalposts later.
Core quality criteria to include
- Overall transcript accuracy
- Accuracy for names and proper nouns
- Accuracy for numbers, dates, amounts, and identifiers
- Speaker diarization accuracy
- Formatting consistency
- Turnaround time
- Instruction follow-through
Suggested acceptance criteria
Set thresholds that match your use case. If names and numbers matter a lot in your workflow, make them must-pass items even if the rest of the transcript looks good.
- Names and proper nouns: zero critical errors in approved sample sections
- Numbers, dates, and amounts: zero critical errors in approved sample sections
- Speaker diarization: at least 90% of speaker changes labeled correctly
- Formatting and style: at least 95% compliance with the provided style guide
- Turnaround time: delivered within the agreed pilot deadline
- Completeness: no missing sections, skipped speech, or unexplained blanks beyond agreed rules
Define what counts as a critical error
Not all errors carry the same weight. A missing comma is different from a wrong medication dose, wrong legal name, or wrong dollar amount.
- Critical error: changes meaning, identity, amount, date, or who said it
- Major error: clear mistake that affects usability but not core meaning
- Minor error: punctuation, style, or formatting issue that does not change meaning
If you work in regulated environments, align your review rules with internal policy. If accessibility is part of the scope, review relevant captioning and transcript requirements from the WCAG guidance when you define deliverables.
Step 3: Run the pilot with a standard test script and checklist
A pilot works best when each vendor gets the same brief. Keep instructions simple, specific, and identical across vendors.
Sample pilot test script
- Project: Pilot transcription test for vendor evaluation
- Files included: 5 sample recordings labeled A to E
- Audio type: mix of clean, moderate, and difficult recordings
- Output required: verbatim transcript in DOCX and TXT
- Speaker labels: identify each speaker change where possible
- Numbers: write dates, amounts, and identifiers exactly as spoken unless style guide says otherwise
- Names and terms: use the attached glossary for proper nouns and technical terms
- Unclear audio: mark unintelligible sections using the agreed tag format
- Deadline: submit all files by [date and time]
- Questions: submit any clarification questions before [date and time]
Pilot execution checklist
- Choose 5 to 10 recordings from real workflows
- Create a short style guide and glossary
- Define acceptance criteria and scoring rules
- Send the same files and instructions to all vendors
- Track response time and any clarification questions
- Review delivery format, completeness, and deadline compliance
- Score quality using the same reviewer or the same review team
- Document issues by file and by error type
Review process tips
Use a reference transcript or a reviewer-approved answer key for the scored sections. You do not need to score every second of every file if time is tight, but you should score the same parts for every vendor.
Blind review helps if you want to reduce bias. Replace vendor names with neutral labels like Vendor 1, Vendor 2, and Vendor 3 during scoring.
Step 4: Use a simple scoring sheet to compare vendors fairly
A scoring sheet turns feedback into a decision tool. Keep it simple enough that reviewers will use it the same way every time.
Sample scoring categories
- Names and proper nouns: 25 points
- Numbers, dates, and amounts: 25 points
- Speaker diarization: 15 points
- Overall transcript accuracy: 15 points
- Formatting and style compliance: 10 points
- Turnaround time: 5 points
- Communication and issue handling: 5 points
Sample scoring sheet
- Vendor name:
- Files reviewed:
- Delivered on time: Yes or No
- Required format delivered: Yes or No
- Critical errors in names: [count]
- Critical errors in numbers: [count]
- Diarization accuracy: [percent]
- Style compliance: [percent]
- Unclear tags used correctly: Yes or No
- Total score out of 100:
- Must-pass criteria met: Yes or No
- Reviewer notes:
How to score diarization accuracy
Count speaker changes in the reviewed section, then count how many were labeled correctly. Divide correct labels by total speaker-change events to get a percentage.
For example, if a file section has 20 speaker changes and the vendor labels 18 correctly, diarization accuracy is 90% for that section.
How to compare automated and human workflows
If you are comparing a human vendor with an AI-first option, score both against the same checklist. That keeps the review focused on outcomes instead of process.
If speed is the main goal for some content, you may also want to test automated transcription as a separate lane. Just do not mix acceptance criteria for high-risk files with criteria for low-risk files.
Step 5: Make a go or no-go decision
Your final decision should combine must-pass criteria with total score. A vendor with a high total score should still fail if they miss the items that matter most to your use case.
Recommended decision rules
- Go: vendor meets all must-pass criteria and reaches your target total score
- Conditional go: vendor misses only minor items and has a clear correction plan
- No-go: vendor misses any critical must-pass item or shows repeated error patterns
Questions to ask before final approval
- Did the vendor handle difficult audio well enough for your real workload?
- Were names and numbers accurate enough for your use case?
- Was speaker labeling reliable enough for interviews, meetings, or research?
- Did they follow instructions without repeated reminders?
- Did they deliver on time and in the right format?
- Was communication clear when something in the audio was hard to hear?
Common pilot pitfalls
- Testing only clean audio
- Using vague success criteria like accurate enough
- Reviewing different sections for different vendors
- Ignoring diarization because word accuracy looks fine
- Failing to separate critical errors from minor style issues
- Changing thresholds after results come in
If two vendors score closely, run a second-round pilot with harder files or a larger sample. That is often better than choosing based on small score differences.
Practical pilot template you can copy
1. Pilot scope
- Number of vendors: [insert number]
- Number of files: [insert number]
- File length range: [insert range]
- Audio types: clean, moderate, difficult, jargon-heavy, number-heavy
- Deadline: [insert date]
2. Must-pass acceptance criteria
- No critical errors in names in reviewed sections
- No critical errors in numbers, dates, or amounts in reviewed sections
- Diarization accuracy of at least [insert threshold]
- Delivered by deadline
- Required format and style guide followed
3. Score threshold
- Minimum total score to pass: [insert score] out of 100
4. Review method
- Reviewer names: [insert names]
- Blind review: Yes or No
- Reference transcript available: Yes or No
- Sections scored: [insert timestamps]
5. Final decision
- Vendor selected: [insert name]
- Decision: Go, Conditional go, or No-go
- Reason: [insert short summary]
Common questions
How many files should I include in a transcription vendor pilot?
Most teams can start with 5 to 10 files. Use enough variety to reflect your real work, especially difficult audio and files with names or numbers.
Should I use real recordings or sample audio?
Use real recordings whenever possible. Sample audio is often too clean and may not show how the vendor performs under normal conditions.
What is the most important acceptance criterion?
That depends on your workflow, but names, numbers, and speaker diarization are often the most important because errors there can change meaning fast.
How do I test speaker diarization?
Choose sections with clear speaker changes, count the change events, and measure how many labels are correct. Review the same sections for every vendor.
Can I compare AI transcription and human transcription in the same pilot?
Yes, if you use the same files and scoring rules. If your content has different risk levels, set separate acceptance criteria for low-risk and high-risk work.
What should trigger a no-go decision?
Repeated critical errors, missed deadlines, poor speaker labeling, or failure to follow instructions are common no-go triggers. A high overall score should not override a must-pass failure.
What if two vendors score almost the same?
Run a second-round pilot with harder files or a larger sample. That usually gives a clearer answer than debating a small score gap.
If you want a more reliable way to evaluate transcript quality before scaling up, GoTranscript provides the right solutions, from pilot-friendly workflows to professional transcription services that fit different accuracy and turnaround needs.