Innovative Real-Time Captioning with Open Source Tools
Exploring Vosk for live captions using FreeSwitch. A viable, privacy-friendly alternative to Google's transcription services.
File
bbbdev16 Live Transcription with Vosk
Added on 01/29/2025
Speakers
add Add new speaker

Speaker 1: Okay, I really wanted to present, uh, this is, this is the demo. Okay. So the, in 2.6, um, our team from Mconf contributed something, which is, uh, live, uh, captions, which, uh, use Google web speech, which is a really, it's a really simple to integrate API from Google, where, uh, every browser will, um, uh, send, uh, their audio into Google. This works on Chrome, only on Chrome, right? And, uh, the audio goes to Google servers. They use their magic AI stuff, transcribe it and send it back. And then, uh, the blue button will have the audio for each user and we can, uh, do live captions. This is all good, but, uh, it uses Google, which, uh, uh, is appropriate proprietary system, and also this is really, um, painful for people all around Europe and actually in Brazil with our new legislation because of the data privacy. So the idea here is let's use, uh, one of the newer open source AI systems so that we can host them ourselves. And that's what we did. And we didn't do it before because this technology is really new. And, um, but now, uh, me and Arthur, which is kind of shy, he's somewhere, but he really helped a lot. So we just took, um, took something called Vosk, which is, um, a transcription server that actually uses something called CAUD, uh, on the, on the background. And so what we're doing technically is really simple. We just intercept the call on FreeSwitch, which is our audio server, the audio server that BBB uses. And, uh, we use something called mod audio fork. We intercept the call and we send this directly into Vosk and Vosk sends us back, um, JSON messages with information from, uh, transcription information. It's not, um, perfect as I guess you're seeing, especially when, uh, when you, you're saying lots of, uh, nonsense made up words that programmers use. So this is probably something that's, oh, no, it stopped. Okay. It's back. Uh, so this happens with Google as well. So, uh, be, be kind with me, but, uh, so this is running. Yeah. The idea is this is running Vosk and this is transcribing my audio in real time. This would work for all the users in the session. We can get their audios independently. So that's pretty cool. And here on the side, we have, um, oh, I'm not sure why it stopped, but the idea is that, um, the way that, uh, most transcription systems work is that they will give you partial results, which as you see, uh, can change over time because, uh, the AI maybe is more sure, uh, of a better translation the more you speak, right? So, and, uh, we displayed those partial results. And then when it's, uh, kind of sure that you finished the sentence, because maybe because you stopped talking for a while, or maybe because it just thinks it should end there, it will give you something that like a final result. So the idea is that we, this is an Etherpad, which is actually kind of, um, uh, it's an old system that we had for transcriptions, whereas just, uh, someone, a transcriber would be transcribing them, manually typing them. And, uh, this is kind of, um, a mashup of both systems. So when we have the final result, it will just be printed, uh, actually added, appended to the Etherpad. Um, it's a live demo, so something went wrong here, but, uh, until, uh, until a few moments ago, it was just sending the messages there. And the idea is that after the messages have passed, someone could just edit them and fix the, some bad translations, bad transcriptions and stuff. But I think that's it. Um, going forward, um, the, the components that we did to kind of glue this together, uh, we're making it configurable so we could have something, uh, other transcription servers. So the other candidate is, uh, Whisper, uh, OpenAI Whisper, which is something that maybe if, uh, we got lucky, we can get working before the end of the afternoon, because it's kind of, um, a matter of, um, doing the, the handshake. It's expecting and sending the data in the correct format, but it shouldn't be too hard, but that's the demo. That's it.

ai AI Insights
Summary

Generate a brief summary highlighting the main points of the transcript.

Generate
Title

Generate a concise and relevant title for the transcript based on the main themes and content discussed.

Generate
Keywords

Identify and highlight the key words or phrases most relevant to the content of the transcript.

Generate
Enter your query
Sentiments

Analyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.

Generate
Quizzes

Create interactive quizzes based on the content of the transcript to test comprehension or engage users.

Generate
{{ secondsToHumanTime(time) }}
Back
Forward
{{ Math.round(speed * 100) / 100 }}x
{{ secondsToHumanTime(duration) }}
close
New speaker
Add speaker
close
Edit speaker
Save changes
close
Share Transcript