Implementing Whisper in OpenCast: Insights and Challenges
Explore how Whisper was integrated into OpenCast, discussing benefits, challenges, and future enhancements with insights from developers.
File
Automatic subtitling using Whisper AI
Added on 01/29/2025
Speakers
add Add new speaker

Speaker 1: Thank you, well I think all you know me, I'm Maximiliano Lira Alcanto, I'm from the University of Cologne and I'm working also for the EduCast NRP project and today I will make it another whisper talk. This is only a technical how it was implemented in our codebase in OpenCast, I will be very very fast, I will skip all the definitions, thank you for my colleagues to explain it before. So that is only introductory, you know that, and how whisper was implemented in OpenCast. This is like things that fell from the sky, like here is something that you can implement in OpenCast and I say okay I will try it. First I tried with an implementation with an external service in the whisper community is very active so there was a guy that create an AP REST server using Docker so I just okay I will put up on our cluster that has a GPU and then I created a script with the execute once workflow operation handler, works like a charm. It's very easy to use and it's similar like our colleague present before. So what is the only disadvantage that you need more configuration from the user in the way that you need to put another computer but it was very, and all the execute once things that you have to do. This project is ready on the helper scripts repository if anyone wants to use it. And here is how was the execute workflow operation handler. Very simple, take the media file, then strip only the audio, send the audio to this server, the server gives me back the file, the stream of text as an API answer and then I save it in a VTT file and I delete the audio file that I use. Very simple. Then the next thing it was okay we can do this in Opencast. In one of the weekly developer meetings Lars said yeah it could be like Vosk because it's very similar. So I say okay I will try again and it's ready right now in the latest Opencast but one thing it's like it's ready so you can use it. How it was developed? It was very difficult to develop because I wanted in a way to have it the less code possible so I don't want to code with any other class or interface or anything. So I say we use OSGI so I can make one of the, I can choose the implementations using the same interface. It was more difficult than I thought but thank you for Gregor Eichelberger, he knows a lot of OSGI. I was able to make this very simple line with that I can implement Whisper with Vosk at the same time without adding any more code. Oh sorry, without any other code more than the Whisper engine. So what were the, okay now when you have a Whisper working in your machine there is some Whisper errors that you can find. For example if it doesn't run in Opencast if you're running you have to check if the latest version is installed, run Whisper manually and check its Opencast configurations files regarding the speech to text. That is the workflow operation handlers on the configuration files. This is an example from the workflow operation handler using Whisper I. Ah here it is. It's very simple. I want to say thank you for Harvard because they made the conditional config workflow operation handler. They made it so easy to implement this because you don't know if the presenter of the presentation has, it's available. So I create first a conditional config that says if there is a presenter use the presenter video, if not use the presentation video and then this is the speech to text workflow operation handler. It's the same as the BOSC one but you have to configure it in the Opencast files that to use Whisper and it's very simple as you can see and bam you have your video file. Future work. What we have decided but not planned, the integration of FastWhisper. A few weeks ago there is another fork of Whisper called FastWhisper that is very very fast because it uses C instead of Python so you can get at most like a 30% of speed increase. The custom Whisper commands today Opencast implementation uses very similar default configuration of Whisper so you can't configure the temperature or other advanced task and the live transcription but I think that is out of scope of Opencast because that is more like a task for the relay server like BOSA or any other way you use for the streaming. There is a lot of ways how to integrate Opencast, I think you see everything. Execute once, I love that workflow operation handler and the Java implementation it's very simple to use, thank you for the OSGI way and try Whisper today, yeah.

Speaker 2: So, when to use Whisper via the web service and when to use it as the workstation? Workflow operation handler, what's the difference and what's the pros and cons?

Speaker 1: The main advantage you have when you use an API server that you can have a computer outside of the Opencast cluster. For example, you have contract some specialty GPUs outside so you can call this server. Everything as a workflow operation handler is managed from the worker itself and you can for example if you have a GPU you can use it as a GPU and if you configure the job loads from the STT workflow operation handler you can make it like a specialised worker. That is the most biggest difference. I well both of the ways has their advantage and it's more simple to use a workflow operation handler because you only Opencast takes care about the files and everything.

Speaker 3: You've talked earlier about the different implementations of Whisper, so the GPU based, the CPU based and you also said now that there's fast Whisper. Which of these implementations does the workflow operation handler work right now?

Speaker 1: Works actually, works only with the official one and I have to make a warning right now because Whisper just had an upgrade last week and if you upgrade Whisper the workflow operation handler will not work. There is a pull request ready for review so please review it and merge it because so it will work with the latest version.

Speaker 3: And the official one is the GPU based ones, right?

Speaker 1: GPU and CPU. You can use both. The CPU is very slow, that is the only problem. For example, with my Intel Mac that is Intel i5, another from the one that I use in Cologne, that for example one hour it was in 45 minutes but with the GPU it was in less than two minutes and I have to say not a very top of the line card, it was like the very basic NVIDIA card, it was a quadro but the basic quadro line so yeah.

Speaker 4: Because I tried in our CPU nodes, Whisper, without doing anything, just installing Whisper CLI and also not configuring anything, our hardware CPU nodes are 32 cores and 64 cores and also an amazing amount of RAM and it started the Whisper operation at 7 o'clock in the afternoon and the next day at 8 o'clock it was keep running. For one hour and 30 minutes, it was a normal lecture. And I had to stop it.

Speaker 1: What model are you using?

Speaker 4: I showed that it was really slow and I switched from medium to small and it kept running.

Speaker 1: With a small? Wow. Yeah, my test with the Mac was with the tiny.

ai AI Insights
Summary

Generate a brief summary highlighting the main points of the transcript.

Generate
Title

Generate a concise and relevant title for the transcript based on the main themes and content discussed.

Generate
Keywords

Identify and highlight the key words or phrases most relevant to the content of the transcript.

Generate
Enter your query
Sentiments

Analyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.

Generate
Quizzes

Create interactive quizzes based on the content of the transcript to test comprehension or engage users.

Generate
{{ secondsToHumanTime(time) }}
Back
Forward
{{ Math.round(speed * 100) / 100 }}x
{{ secondsToHumanTime(duration) }}
close
New speaker
Add speaker
close
Edit speaker
Save changes
close
Share Transcript