Blog

How-to Guides

How to Transcribe Audio and Video on Linux (Ubuntu): A Step-by-Step Guide

Matthew Patel

Posted in Zoom Dec 17 · 19 Dec, 2025

How to Transcribe Audio and Video on Linux (Ubuntu): A Step-by-Step Guide

To transcribe audio and video on Ubuntu, you first need a clean audio track, then you choose a transcription method: built-in speech-to-text, a Linux desktop app, or an online transcription service. This guide walks you through recording or extracting audio with OBS and ffmpeg, converting files, improving audio quality, and finishing with a simple upload workflow for human transcription or captions.

Primary keyword: transcribe audio and video on Linux (Ubuntu)

Key takeaways

Start with good audio: correct mic selection, stable levels, and minimal noise beat any “magic” transcription tool.
Use ffmpeg to extract audio from video and convert to WAV/FLAC at a usable sample rate.
Pick a transcription approach based on your needs: speed (automated), privacy/offline (local), or accuracy and formatting (human).
When you upload for human transcription or captions, request speaker labels, timestamps, and “verbatim” vs “clean read” up front.

Step 1: Record or extract the audio (Ubuntu-friendly options)

If you already have a video file, the fastest path is to extract a separate audio file before you transcribe. If you still need to record, record with settings that make speech clear and consistent.

Option A: Record on Ubuntu with OBS Studio

OBS works well for screen recordings, webinars, and interviews because it lets you choose the exact audio input and monitor levels. If you use OBS, do a short test recording, listen back, and only then record the full session.

Select the right mic: In OBS, set your Mic/Aux input to your real microphone (not “Monitor of…” unless you intend that).
Watch levels: Aim for speech peaking around -12 dB to -6 dB to avoid clipping.
Record in a predictable format: If OBS produces a video container (like MKV/MP4), you can later extract audio with ffmpeg.

Option B: Extract audio from a video with ffmpeg

On Ubuntu, ffmpeg is the workhorse for pulling audio out of MP4/MKV recordings and converting it into a transcription-friendly file. Install it if needed, then extract or convert with the examples below.

Install: sudo apt update && sudo apt install ffmpeg
Check what streams are inside a file: ffprobe input.mp4

Extract audio without re-encoding (fast, if compatible):

ffmpeg -i input.mp4 -vn -acodec copy output.aac

This keeps the original audio codec, which is quick, but not always ideal for editing or cleanup.

Convert video to WAV for transcription (safe, editing-friendly):

ffmpeg -i input.mp4 -vn -ac 1 -ar 16000 -c:a pcm_s16le output.wav

This creates a mono 16 kHz 16-bit PCM WAV, which many workflows handle well for speech.

Convert video to FLAC (smaller than WAV, still lossless):

ffmpeg -i input.mp4 -vn -ac 1 -ar 48000 -c:a flac output.flac

Step 2: Use recommended audio settings (and why they help)

You can transcribe almost any reasonable file, but certain settings make speech easier to process and review. The goal is clear voice, stable volume, and minimal background noise.

Recommended settings for speech transcription

Channels: Mono (-ac 1) unless you need stereo to separate speakers.
Sample rate: 16 kHz to 48 kHz (-ar 16000 or -ar 48000).
Format: WAV (PCM) or FLAC for clean editing and fewer surprises.
Mic placement: 6–12 inches from the speaker, off-axis to reduce plosives.

Noise reduction basics (keep it simple)

Noise reduction helps, but overdoing it can distort speech and make transcription harder. Try to reduce noise at the source first, then apply light processing.

At the source: Close windows, turn off fans if possible, and use a consistent mic distance.
Use a pop filter: It reduces “p” and “b” bursts that can clip.
Prefer gentle cleanup: Remove constant hums, then normalize volume, then stop.

Optional: Normalize volume with ffmpeg (simple and reversible):

ffmpeg -i input.wav -af loudnorm output_loud.wav

This can make quiet speakers easier to hear, but always listen to a short segment after processing.

Step 3: Convert formats with ffmpeg (copy-paste command examples)

File conversion is often the difference between a smooth transcription workflow and hours of troubleshooting. These commands cover the most common “I have X, but I need Y” situations on Ubuntu.

Convert common audio formats to WAV or FLAC

MP3 to WAV (mono, 16 kHz):
ffmpeg -i input.mp3 -ac 1 -ar 16000 -c:a pcm_s16le output.wav
M4A/AAC to FLAC (mono, 48 kHz):
ffmpeg -i input.m4a -ac 1 -ar 48000 -c:a flac output.flac
WAV to FLAC (lossless compression):
ffmpeg -i input.wav -c:a flac output.flac

Extract a specific audio track from a video

If a file has multiple audio streams (like multiple languages), you can pick one by index. Use ffprobe first, then map the stream you want.

Example: extract audio stream 1 to WAV
ffmpeg -i input.mkv -map 0:a:1 -vn -ac 1 -ar 16000 -c:a pcm_s16le output.wav

Trim the audio (useful before you upload)

Trimming helps you remove dead air, false starts, or private sections you do not want transcribed. Always verify the cut point by listening.

Trim from 00:02:00 to 00:12:30:
ffmpeg -i input.wav -ss 00:02:00 -to 00:12:30 -c copy output_trim.wav

Step 4: Choose a transcription approach on Ubuntu (local, app, or service)

The best approach depends on your priorities: accuracy, speed, budget, privacy, and formatting needs. On Linux, you can mix approaches, like doing a quick automated draft and then sending the final file for human transcription.

Approach 1: Built-in or browser speech-to-text (quick drafts)

This works best for short clips, clear audio, and situations where you only need rough notes. It often struggles with multiple speakers, accents, cross-talk, and technical terms.

Best for: personal notes, brainstorming, first-pass summaries.
Watch out for: privacy (cloud processing), missing punctuation, and speaker confusion.

Approach 2: Desktop apps or local models (more control, offline possible)

Local transcription can be a good fit when you need offline processing or repeatable workflows. You still need time to review and fix the output, especially for names and jargon.

Best for: teams with technical staff, sensitive audio that must stay local, and repeatable batch jobs.
Watch out for: setup time, model downloads, and CPU/GPU requirements.

Approach 3: Upload to a transcription service (best for accuracy and formatting)

If you need a clean, publish-ready transcript, human transcription usually saves time because you get consistent speaker labels, punctuation, and formatting. This is also the simplest path when you need captions or subtitles, not just plain text.

Best for: interviews, podcasts, legal or research recordings, and client-facing content.
Watch out for: you may need to redact sensitive data before upload, depending on your policies.

If you want an automated option as a first pass, you can also compare with automated transcription and decide whether you still need a human final.

Step 5: Troubleshooting on Ubuntu (permissions, codecs, mic, low volume)

Most transcription problems on Linux come from the audio pipeline, not the transcription itself. Fix the source issues first so you do not waste time “correcting” a transcript that never had a chance.

Problem: The browser or app can’t access my microphone

Check OS permissions: Ubuntu Settings → Privacy → Microphone, then allow access for the app.
Check browser permissions: Ensure the site has microphone permission in your browser settings.
Check device selection: Open Ubuntu Sound settings and confirm the correct input device is selected.

Problem: My recording has no sound, or the wrong mic recorded

In OBS: Confirm Mic/Aux points to the intended microphone and is not muted.
In system settings: Settings → Sound → Input, then speak and watch the input meter.
Test before a long recording: Record 10 seconds and listen back every time you change gear.

Problem: “Codec not supported” or ffmpeg errors

Inspect the file: Run ffprobe input.ext to see the codec details.
Transcode to a safe format: Convert to WAV or FLAC using the commands above.
Update ffmpeg: If you are on an older Ubuntu release, your ffmpeg may lack some decoders.

Problem: The audio volume is too low

Raise input gain: Increase mic input level in Settings → Sound → Input.
Move closer: Halving the distance to the mic can help more than post-processing.
Normalize carefully: Try ffmpeg -i input.wav -af loudnorm output_loud.wav and listen for artifacts.

Problem: Background noise or echo ruins accuracy

Reduce echo: Record in a smaller room, add soft furnishings, and avoid laptop mics when possible.
Separate speakers: If you can, record each speaker to their own mic or track.
Avoid aggressive noise reduction: If voices sound “watery” after cleanup, back off.

Step 6: Upload to GoTranscript for human transcription or captions (simple workflow)

If you want a polished transcript with speaker labels, timestamps, and clean formatting, the simplest workflow is: prepare a clean audio file, upload it, then specify exactly how you want the transcript formatted. You can use the same starting file to order captions for video.

What file formats should you upload?

In practice, you will have the smoothest experience if you upload a common audio/video format, such as WAV, FLAC, MP3, M4A, MP4, or MKV. If you are unsure, convert to WAV or FLAC with ffmpeg, then upload that file.

What to include in your order notes (speaker labels, timestamps, style)

Speaker labels: Ask for “Speaker 1, Speaker 2” if you do not know names, or provide names if you do.
Timestamps: Request timestamps at your preferred interval (for example, every 30 seconds or at speaker changes).
Verbatim vs clean read: Choose verbatim if you need every false start and filler word, or clean read for readability.
Proper nouns and jargon: Provide a short glossary (names, acronyms, product terms) to reduce ambiguity.
Multiple files: If the audio spans parts, label files clearly (Part_01, Part_02) and note if they share speakers.

Human transcript vs captions: how to decide

Choose a transcript when you need text for editing, quotes, notes, or publishing as an article.
Choose captions when you need timed text synchronized to video for accessibility and viewing without sound.

If you need video captions specifically, see closed caption services for caption-ready outputs.

Upload workflow (Ubuntu-friendly)

1) Prep the file: Extract and convert to WAV/FLAC if needed, and trim private sections.
2) Spot-check audio: Listen to 30–60 seconds from the start, middle, and end.
3) Upload: Use the GoTranscript order page: Order transcription.
4) Choose options: Select transcript type and add instructions for speaker labels, timestamps, and verbatim vs clean read.
5) Keep a reference: Save your original file name and any glossary you provided for later review.

Common questions

Should I upload the video file or an extracted audio file?
Either works, but an extracted WAV/FLAC often uploads faster and avoids video codec issues while keeping speech clear.
Is mono or stereo better for transcription?
Mono is usually best for speech because it is simpler and smaller, but stereo can help if each speaker is on a different channel.
What sample rate should I use: 16 kHz or 48 kHz?
16 kHz is common for speech-focused workflows, while 48 kHz is common for video production; both can work well if the audio is clean.
How do I handle multiple speakers on one mic?
Ask speakers to take turns, reduce cross-talk, and keep a steady mic distance, then request speaker labels when ordering.
Why does ffmpeg create a file with no sound?
The input may have multiple audio streams, or you may have mapped the wrong stream; run ffprobe and use -map 0:a:0 (or another index).
What’s the difference between verbatim and clean read?
Verbatim keeps filler words and false starts, while clean read removes most fillers and tidies the text for easier reading.

If you want a reliable, publish-ready transcript or caption file without managing tools and cleanup yourself, GoTranscript provides the right solutions through its professional transcription services.

Order Now