Exploring Whisper AI: Transforming Language Translation

Convert Your Audio To Text

4.9/5

3726 customer reviews

Discover Whisper AI's capabilities in live translation and transcription, presented by a speaker who uses this OpenAI tool to bridge language gaps in global meetings.

Whisper AI Live English Subtitles for 96 Languages Mathias Arens

Added on 01/29/2025

Speakers

Add new speaker

Speaker 1: Thank you very much for listening to my speech. It is a little difficult to speak in Japanese all the time, so I will continue in German from now on. So I will continue in German. I think a little more people here in the room understand that. Maybe not all of you, but maybe it is a little better in English. So, we start with introducing a bit this AI that I was using, the Wisp AI. Then I will tell you how I use it to generate the subtitles that you just saw. And lastly, we go a little bit about how to use asynchronous programming in Python, whether to use threading or multiprocessing, and what are the differences between that. So the AI I was using is called Whisper AI. It is by OpenAI. It was released last September, which is between the release of Stable Diffusion and ChatGPT, which is, I think, a reason why this AI has been a bit overlooked since then. I think it is very useful, and maybe you saw why just now. It is actually completely open source, unlike ChatGPT. You can see the code on GitHub, and the weights are in a hugging phase, so you can use it on any computer that you have, which can run it. Let's continue with this overwhelming slide. Don't worry, we won't go through everything, but this is basically all the details that there are for Whisper. It is also from the OpenAI paper. Let's just start with the training for the AI, what they have done, how did they achieve what you just saw. They trained on mostly three different kinds of data. They trained on English sound with English subtitles. They trained on non-English sound with non-English subtitles, and also some on, for example, German audio with English subtitles, so the translation was also in the training data. They used a lot of different kinds of audio, I think mostly from YouTube, but also probably other sources, which means that this model is very, very robust, and it doesn't really need any fine-tuning. What I just showed was the pure model that you can download, and maybe you noticed that I didn't even have to tell it what language I was speaking, I didn't switch anything, I just spoke in different languages, and it just realized what's going on and translated all of it to English, because that's what I told it to do. It could have also told it to write down what I'm saying in the same language, but that's not quite as useful, maybe. Back to this overview. On the top right, we have the architecture of the model, which we're going to look at next, and if you have been following AI news in the last year, you might recognize this shape. It might be like this. It is actually the exact same architecture, pretty much, as transformers, so the same as chat.gpt, the T stands for transformer. What we see, I'll just go through it a little bit in a little bit of detail, I'm not a super expert on that, but chat.gpt has inputs and outputs, and both are text. You put some text in, and it gives you the next piece of text. The architecture is that there is something called attention, which means that to predict the next token, it looks at all of the tokens, or the context window of the tokens that it has, that came before, but not all at the same level. The attention part means that some tokens relate to other tokens more specifically than others. This is very simple mathematics, actually, and a very genius concept, and it works really well. It just means that, for example, if you're talking about... Well, if I start speaking Japanese, and it has a keyword, it's obviously Japanese, and that will spread to all the other words and say, hey, we are talking Japanese, for example. That's kind of how it works. And Whisper doesn't have text on the input side, but sound waves, but the output is exactly the same, it's also tokens of text. So Whisper learned to listen to sound, and it also learned the correct text output. So it tried to predict the next text token, just like chat.gpt, just the input was a little bit different. So now, when I'm speaking, it listens to the sound, it tries to predict the next text token, and it's really basically the same thing as a text transformer. This is quite a nice parallel. The last part I'm not really going to explain. This is just about how the tokens work in Whisper exactly, which is not really extremely interesting. Instead, I'll show you another example of what I mean by robustness of the model. So I'm going to play this clip, it's the beginning of a song, and I want you to try to understand the lyrics.

Speaker 2: Who understood everything?

Speaker 1: Two people. Maybe you are from Britain? Maybe you understand sign language? Or maybe you are Ed Sheeran super fans? I don't know. But two people out of maybe 50. Of course, I also don't really understand everything. One of the first things I did was put this song into Whisper. So just the base Whisper that you can download, and the base version of it takes sound, basically an MP3, for example, works for a while, and then gives you all the text that's in this MP3. So let's see what Whisper heard. You can check the text after these brackets. I'm going to play it again.

Speaker 2: ♪♪♪

Speaker 1: Not too bad, I think. Yeah. Well done, OpenAI. Yeah, and actually this text was not generated by the real Whisper, the largest model. There's one step down. They have a few different versions, and I used the medium one, which runs on my own computer at home. So I have a gaming computer with a graphics card from three years ago. And mainly just the one thing that's limiting is the graphics memory. So I have 8 gigabytes, and the real Whisper needs 11. So this is not even the full, not even its final form. Yeah, another example of how good Whisper is. So many people tell me, ah, Microsoft Teams has the same feature, right? Yes, it does. It does have the same feature. You can speak any language, and it will give you an English subtitle. So here's a little bit of a, I turn it on while we had a meeting at work. It says, yes, so it may be that she does not jump in that at all. The problem was, and then the relaxed, it was really because of this. Okay. Very helpful. And at the same time, I was running my own, again on my medium-sized graphics card, my own program, which is the same, kind of the same that I demoed at the beginning. And it says, so it could be that the sprint wasn't the problem. And then it was really because of these block things. So you kind of, no, I mean, if you know we're in a retro, we're doing Scrum and everything, that makes sense. So that is a difference. But the biggest difference here is that the thing on the left, on the left, is running in a cloud, right? So, I mean, there's some more implications, but I'll get into it later. But the thing on the right is running locally on my machine. So it's really, really portable, and you can do a lot with it. So I think that's really fascinating, why I want to introduce Whisper to the world a bit more. So let's talk about Whisper AI for subtitles. Basically how I used Whisper, the program that just takes an MP3 and gives out text, and made a tool that you saw at the beginning, which transcribes and translates live while I'm speaking, which is not really the same thing. First of all, why did I do this? What was the challenge or the idea? So I used to work in a Munich-based company. This company was founded like over 20 years ago now, and used to only have German employees. Now they have expanded. They have hired people all over Germany and also all over the EU. Some are even working in Australia now. So 90% of the employees are still German and maybe don't speak perfect English. They usually do speak very good English, but maybe not perfect. Maybe it's nicer for them to speak and hear something in German. And 5% don't understand German at all. So this is kind of a conflict. There was a weekly meeting that gives updates about the company, and it used to be only in German, obviously, and it continued to be in German for a while, while there were non-German speakers in the company. So the solution was to have a distilled version of that meeting just after that in English for the English colleagues, which is a workable solution. But what I really wanted is that everybody in the company can come together again into this big meeting where we have important announcements, we learn about the new hires, and we can also celebrate things together all in one meeting, no matter the language you understand. Of course, another solution would be to speak English, but it has its downsides as well, in a previously only German company. So that is basically why I got the idea to develop this tool. And the version that I built at the company was that this meeting is in Zoom, so there's 500 people in the same Zoom meeting, there's a big presentation from an office, and it's streamed to all the participants. So in this meeting, there was a lot of very sensitive topics discussed. We trusted Zoom, and it also has end-to-end encryption, but we did not want to send all the data to any server. So this is, again, Whisper comes in. We have a huge, not huge, a big, very expensive computer in the office, which has a very expensive graphic card. And I just put Zoom on that PC, it joined the meeting, and on that PC was also Whisper.ai running. So Whisper.ai was basically also in the meeting, listening in to the German, generating the subtitles live, as you saw just now. The subtitles are combined with the video from the meeting in a little tool called OBS Studio, which is very useful for streaming, for example, also. In OBS Studio, I define something called a virtual camera, and I stream back the combination of the meeting with the subtitles that are generated live into the meeting so that if people want to have subtitles, they can just watch the user, which is logged in on the AI computer. So basically, Whisper.ai is in the meeting as well, listening in and typing really fast. And people can just look at its camera to get the subtitles. This worked reasonably well. Since we are at EuroPython here, I'm going to zoom in on the Python part a bit and explain a little bit more about the challenges with the live transcription and translation. So the basic... It's a bit small. The basic structure of the program is that the Python records a little bit of audio, for example, one second. Maybe also adds it to make a longer, longer, longer piece of audio. Then gives it to Whisper, because Whisper can only transcribe pieces, like files of audio. It can't really do live yet. So I'm kind of making it transcribe really fast and many, many different small things. It does its thing. It gives me the subtitles, and I show it. So no problem. But if you just implement it like this, without any thinking about asynchronicity, you realize that while Whisper is working, so the AI needs like one, two seconds to run, the audio code is not running. So I'm always losing as much sound as Whisper is taking time to translate. So, of course, that's not good. So parallelism has to come in. What I really want is I want a thread or a piece of code to run in the loop and record one second of audio every second, really without a gap. And I want another piece of code to use Whisper, send it the audio that we have currently, and get the results back. Any time it's done, it immediately gets a new audio file to transcribe. And then on top, of course, I want to show the subtitles. I want to have them available not just while the AI is not thinking, but all the time. So three kind of separate parts of the code. Yeah, what I used is threading, the threads library from Python. I basically just started a thread for the listening code, which did this loop. I started a thread for the transcribing. And that's it. The rendering is a trick because it's actually shown via a browser. So I just use setTimeout in JavaScript and ask every 500 milliseconds to get the new text. Quite easy. So this is the details of what I just said. The audio uses a temporary file, a wave file. So actually the synchronization between these two loops is done via hard drive. So because operating systems are quite good at having multiple processes read the same file, let's say. And that wave is really nice because you can add to the end, read from the beginning, delete from the beginning. It's super nice. It's basically like an array of audio on the hard drive. Yeah, so I put one second at the end of the file every second. And every few seconds, whenever Whisper was done, sorry, and then I told the Whisper thread, hey, there's a new piece of audio. Go. And I did this via a queue. So I just said, I just added the audio file to the queue. And the Whisper thread reacts to this queue every time that something is added, which is basically every time it's done. It runs, takes these three seconds, and outputs some text, hopefully. And then what you saw at the beginning is that sometimes the text is gray because it's not really clear if that's really what was said. Sometimes it's black because I say, okay, I'm sure now. It's done for two reasons. I don't want to have this wave file grow and grow and grow because then the AI would have to think about more and more and more sound. It would get slower and slower and it would just not work at the end. And also I want to display subtitles. I kind of need to say, okay, this subtitle is gone now. The next one is coming. Things like that. So whenever it found a whole sentence, and very nicely, Whisper gives us these things, right? Remember with the song, we have, it was in four lines. So it says there's a phrase here, a phrase here, a phrase here. Whenever Whisper tells me a phrase has finished and a new phrase has started, I take the text and commit it. It tells me when this phrase was in the file, in the sound file, so I can just delete the time from the sound file. So it never really grows much. It grows a little bit and then the sentence gets removed at the beginning. And then it grows again, sentence gets removed. It was there. And then, last track. It removes the file from the beginning. And that's it. I'm not sure where I was going. But anyway, when the text is committed, it has another queue to send to the front end, basically, to add another line of text to the result array, so to say. And then the front end, like I said, pulls every half second and gets a new text. There is an open source version of this. It's not quite exactly what is here, but it has the backend part, basically. The parallel recording and transcribing. So that's the details of the program. Don't worry. There will be a link to the slides later. Now just some words about the Python details of threading and multiprocessing, just to compare these two, because there are two different things. Threads, the thread library and the multiprocessing library, and another one later. So threads, they have shared memory. So I started these two threads, but I can actually share objects between them, and I don't really have to think much about allocation of resources. Python does everything for me. They communicate. You can send data between each thread by a queue. So both threads are running asynchronously, and if I give the same queue reference to both threads, they can send each other messages through it. So this leads to asynchronous code, which you saw. However, threads actually cannot run in parallel. It might be surprising, but I'll explain it later. Threads are actually blocked by the GIL, the global interpreter lock, because they are, that means in Python, only one of the pieces of Python can run at the same time. However, when using an AI, we're calling CPython or something, I'm not sure how it works, but it's not Python that is running the TensorFlow model that is not in Python, which is why at that moment, the GIL, the lock is gone, and the rest of the code can run again. So it is asynchronous, but not exactly parallel, but it works just well enough for my use case. And then the more complicated and stronger version would be processes, so multiprocessing. When you use multiprocessing in Python, you actually start whole Python instances, so you have two Pythons running. They are completely separate, which makes it quite a lot more difficult to write the code even, because you have to think about whether this code has now been spawned in the process or whether it is the main code and so on. They also communicate via queue. The queues are not the same queues, that's why, yes, you can see the different colors. This kind of queue that you need for multiprocessing actually pickles the data. So I think it has to be serializable or something. I'm not exactly sure what pickling means, because I'm not actually a Python expert, sorry. Too disappointed. But it is much more strict on what data you can send through the queue. The other queue is very easy. You can send references and things like that. There is also something called pipe, which is just connecting two processes. A queue can be read by multiple consumers and producers. So a queue is really general, but also makes the code a bit harder. Pipes are just one way in, one way out. And they are also very strict. And the main positive part of using processes, of course, is that it actually uses all your cores, if you want. You can, if you spawn eight processes, you can use all your eight CPU cores. But note, so in my example with recording audio and transcribing with Whisper, I don't care about the CPU. I mean, the audio maybe uses the CPU, but Whisper does not. Whisper runs on the GPU. So I don't need multiprocessing. I could, I actually did, the latest version I have is with multiprocessing, but the speed is exactly the same. It just is a little bit more fancy and takes longer to stop. So, but since that is the case, since actually I don't need real multiprocessing in Python, in the very newest version, I used async await, which isn't parallel at all. It is a single event loop, single threaded. It just, you just have to write your code in a way that it never blocks itself. And it's fine because you can wait for the AI and your Python code runs, keeps running. And so I can just use that. It makes coding easier because you don't really need a queue. You can just use in memory variables if you're lazy like me. And if you just use one file, it's super easy to read as long as it's not too much code. And it just makes everything easy enough. Also, it works well with JavaScript and being a server. So it's actually an API for the, what I showed at the beginning is an API for the AI, which leads me to my next topic. What I presented now was running Wisp AI on your computer, which is very nice. It can be very fast, depending if you just buy a nice GPU. The main good thing about it is probably the privacy and data safety. Your data does not leave your room. You can just do it at home. But the downside, of course, is it's quite a heavy thing, you know, the computer. I don't have it here. Maybe you noticed I didn't bring my gaming PC with me. So instead, the new version I wrote, which is on my GitHub link in the end, is actually a server version of this. So what I showed you at the beginning is running right now in the cloud, which is actually not AWS or Google or whatever. I use Lambda Labs, which is quite a nice company. There you can rent really nice GPUs, really fast ones for not too much money. So in AWS, if you use the cheapest GPU, it works. It has enough RAM, but it is so slow. It is basically unusable. Same tool. If you just use it on Lambda Labs, I use the A10. It's kind of the weakest one, which they always have available. You could also use the H100, the best GPU in the world. You can just get it with a click for two euros per hour, which is quite nice. And this is the current version of my code, which runs quite well on the cloud. So for the future, my idea or kind of my dream is to have one of these tools or maybe something that is not quite so big to be able to go to any country, and I just have internet access. I can put it on and understand everybody around me what they are saying. If you have one of these things, talk to me. I want to try it out. And that's it. Thank you.

Speaker 3: Thank you very much for this very inspiring talk. We do have time for some questions. If somebody has a question about this, then please step to the microphone.

Speaker 4: Hi. Loved your speech. You tested out a few major languages, but if you would like to try some more niche languages, not like Japanese, Chinese, not Lithuanian. Lithuanian. Or something, I don't know, from a small country. Would it work?

Speaker 3: Who in the room is from a very small country? Can we have a hand sign, please?

Speaker 1: I tried Lithuanian. It works okay. Lithuanian is not great because there's not much trained data. I have a Lithuanian friend. They were not happy. But if someone wants to ask the next question in an interesting language, I would be happy to.

Speaker 3: Can you please, for this question, come here so that your microphone will work?

Speaker 5: I think this will do. No? Try. Attempt number one, microphone test. Will it work or not? More? So, I don't know. Welcome to the EuroPipeline 2023 conference. This lecture is pretty good. Excellent. I'm enjoying it. Thank you.

Speaker 1: So, I think it can't hear you. I'm not sure what he said. Is that half right, at least? No, it can't hear you. The microphone is over here. So, it can only hear me. Yes, please. Go ahead.

Speaker 3: Okay, let's give this another try.

Speaker 5: Is it this one? So, one more time. Attempt number two, microphone test. First, it didn't work very well, so maybe the second one will be better. No. Well, one more time.

Speaker 3: Yeah, one more time. We have time, so let's try this once more.

Speaker 5: Yeah, in the end, it got something. Hopefully, it will be better after the second one. That's what I think.

Speaker 1: Of course, also silence is confusing a bit, and when you switch languages around, it's not that easy, but the longer you talk, the easier it gets. And there are a few sentences that he said, I think.

Speaker 3: So, another question, please.

Speaker 6: Yes, I also took a look on this Whisper AI, and there are these smaller models. Are they able to run on a normal laptop, or is it simply impossible because they are too small?

Speaker 1: Whisper can run on CPU. If you have a lot of RAM, you can use the big models, but it is really slow. It needs a lot of... It actually needs a fast GPU as well as kind of some space. It needs the speed actually a little bit more than the space. If you can wait, then it's okay if you have RAM and you can just use CPU, it's fine. But I recommend any GPU, especially gaming GPUs because they are faster. The professional ones are not super fast sometimes.

Speaker 3: The next question, please.

Speaker 7: Yes, you mentioned that the AI outputs these phrases. Is it also possible to classify which person is speaking the phrase? The AI cannot do it.

Speaker 1: So, you would have to do some kind of voice recognition. I'm not sure. It's a different problem. Thank you.

Speaker 8: It's quite interesting too.

Speaker 3: Okay, the next question, please.

Speaker 8: Hello. So, there are some alternative implementations of Whisper. There is Whisper CPP, Faster Whisper, Whisper X. Did you try those? The example at the beginning is not Whisper. It's Whisper Jax.

Speaker 1: Oh, it's yet another one. I didn't tell you because I don't really know what Jax means. But it is a faster implementation and it needs a little bit of more installation at the beginning, but kind of works the same. Thanks. And really fast. Thanks, I will try.

Speaker 8: Yeah, go ahead.

Speaker 1: Whisper Jax. J-A-X. That's good.

Speaker 9: All right. So, you were threading the audios and then you were joining and then sending to Whisper. Did you have any troubles doing that? It was straightforward. Any issue?

Speaker 1: Yeah, it's quite annoying sometimes. Handling audio. So, in this last version, I'm handling audio from the browser. So, the browser sends sometimes ODD and sometimes not and sometimes it compresses it. There's some fiddly bits. But overall, especially when you're not using the browser and sending the sound over the internet, but when you're just using the microphone input into Python directly,

Speaker 9: that wasn't that bad. And regarding noise, did you also manage to solve noise or ambient sound?

Speaker 1: No, I mean, I didn't do that. I only used Whisper. I didn't really massage the sounds. Whisper is very good with some noise. But what's interesting is when there's applause, it always says, thank you. Because, of course, that's what people say.

Speaker 3: Okay, thank you. Actually, that is a good last word to thank you again for your talk because we are out of time. Let's have another round of applause for Mathias.

Speaker 1: Thank you.

Speaker 3: And we will take a five minute break.