Best Fast Transcription Tools for Non-English Videos
Discover Whisper S2D for top-notch, fast video transcriptions with CTranslate2 backend. Ideal for non-English streams, tested for speed and quality.
File
OpenAI Whisper No There Are Better Options
Added on 01/29/2025
Speakers
add Add new speaker

Speaker 1: The most popular tool to transcribe videos is OpenAI Whisper. So should you use it? No. Faster Whisper is way faster. But Whisper X is even faster. Wait, there is also insanely fast Whisper. And Whisper S2D which claims to be fast too. Yeah, there's a lot implementations of OpenAI Whisper which are faster and I've tested them all. Because I wanted to generate captions to stream archive website that I made. To save your time. The best implementation is Whisper S2D with CTranslate2 backend. There are faster solutions such as Whisper S2D but with TensorRT backend. It's two times as fast but I've noticed that results are two times as bad. Repeated words over and over, wrong punctuation and many typos. There is one project that I want to highlight. Whisper S2D Transcriber. It has just six stars on GitHub and it will give you GUI that has Whisper S2D pre-configured with CTranslate2 backend. I've used this tool to generate all captions on my site and it worked flawlessly. Installation is pretty straightforward. You need Python, Git, Git large file storage and CUDA toolkit. Although you can also use CPU instead of GPU to perform transcription. If you have all of these tools create a new virtual environment for Python and then execute scripts. After it's done you will see GUI. Here you can add folder if you want to batch transcode several files and set up couple of parameters. For CPU transcribe you should use int8 quantization and for GPU float32 or float16. I've used float16. Now we need to choose size of model and batch size. Both of these are super important and you will need to do benchmarks on your own machine to find the best combo. Larger size of model equals better quality of transcriptions. Especially if they are not in English. Larger batch size means that file will be split into more chunks. Therefore speeding up process of transcription. Because GPUs are utilized a lot better when there is parallelization compared to long form sequential work. So what prevents you from going to the max? Bigger size of model massively increases how much processing intensive transcribing will be and how much memory it will consume. If you are transcribing on graphics card that has just 4GB of VRAM you cannot use large v2 because it won't fit. With batch size bigger is always better but it will also affect VRAM usage because it will work on more batches at once. I have RTX 4070 Super and with this GPU the best combo is large v2 and 20 batches. With this config I have around 80% VRAM usage so I'm sure I won't have out of memory crashes during long transcription sessions. With these settings on my RTX 1 hour 44 minutes stream that was in Polish language was transcribed in 1 minute 21 seconds with very high quality. Yeah, on single consumer midrange GPU you can transcribe 1 hour of non-English video with very high quality output in less than minute. I told you I will save you a lot of time. I was very happy with output in both Polish and English. There always will be some errors in output but Whisper S2T with CTranslate 2 backend gave me the least amount of errors while transcribing very fast. If you need solution for server not a GUI then Whisper S2T is available as docker container so you can easily test it and there is also example code for Python. So in conclusion I hope that I saved your time. Alternatives such as Whisper X, Faster Whisper and Insanely Fast Whisper are slower on consumer GPUs and do not improve quality of transcriptions. I hope that maintainer of Whisper S2T won't abandon this project because this is the best open source implementation of Whisper but he is not active on GitHub for long time so I don't really know. Anyways, that's all for today. Have a nice day.

ai AI Insights
Summary

Generate a brief summary highlighting the main points of the transcript.

Generate
Title

Generate a concise and relevant title for the transcript based on the main themes and content discussed.

Generate
Keywords

Identify and highlight the key words or phrases most relevant to the content of the transcript.

Generate
Enter your query
Sentiments

Analyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.

Generate
Quizzes

Create interactive quizzes based on the content of the transcript to test comprehension or engage users.

Generate
{{ secondsToHumanTime(time) }}
Back
Forward
{{ Math.round(speed * 100) / 100 }}x
{{ secondsToHumanTime(duration) }}
close
New speaker
Add speaker
close
Edit speaker
Save changes
close
Share Transcript