Exploring MacWhisper: A Deep Dive into OpenAI's Speech Recognition Tool
Join Christian Schiller as he explores MacWhisper, a speech-to-text tool from OpenAI. Learn about its features, comparisons, and practical applications.
File
Mac native audio transcription using OpenAI with MacWhisper
Added on 09/06/2024
Speakers
add Add new speaker

Speaker 1: ♪ Welcome to another learning live stream. With me, Christian Schiller. I have the headphones on today because I'm gonna be dealing with some audio again, which is always a bit messy, and I am dealing with that... let's boost the volume a bit. I seem to be a bit quiet. There we go. That's a bit better. I'm Christian Schiller. Welcome to a learning live stream where two or three times a week I look at a piece of technology I'm interested in experimenting with. Mondays and Wednesdays are generally something a little bit more technical, and that is sort of what today is. Sort of. We'll come to that in a moment. I am dealing with the wonderful problem that is happening at this time of the year where there is bright sunshine, but I still have lights. I'm trying to sort of mellow the ambient light, but anyway, we'll get there. Okay, so if you like what you're seeing, you can find more about me, and I still seem to be having this extremely slow fade. I don't know why I keep resetting it. It keeps coming back, but anyway, you can find more about me at ChristianSchiller.com, and if you're watching, you're gonna be watching live on YouTube or later, whether you're live or later, you can subscribe, you can say hello, you can leave a comment, you can click on some ads, if you really want to, and more details about other ways you could support me later, most of which you'll find at ChristianSchiller.com. So what am I looking at today? Well, today I am going to be looking at Whisper from OpenAI. What is Whisper? It is a, well, I can actually open that up now so you can have a look, but it's a speech. I already did have it open, so I have to move it into a different window. It's a sort of speech recognition to text framework from OpenAI, but to use it generally is fairly complicated, so people have made various different versions of it to make it easier to use. Let's actually just jump straight into a screen share right now, I think. Get one more window open. So this is the Whisper repository on the OpenAI organization. You can see robust speech recognition via large-scale weak supervision. It's been going for a little while, actually. And you can see how it all works here. Quite complicated. I have seen people using it on command line to great effect, and it's using Python, which means it works, but it's relatively slow, in quote marks, and you can see the ways you can install it here. You need FFmpeg, which you probably have installed without even realizing, but you'll still need to install it again because that's the way it is, and you can get the model and build it and do all sorts of things. There we go. I'm assuming this is the speed. The smaller, the better. So Spanish, Italian, English, and then we go all the way up to Napoli, Belarusian, Maori. Some quite interesting languages there. And then command line, this is how I saw people using it before, and great. Fine. So one other thing you can have is Whisper C++, which ports it from Python to C++. It's being kept up to date, and it runs on Apple Silicon, which is great, because that means it will be much faster, and a whole bunch of other operating systems, and I am not quite sure what this is demonstrating, but anyway. You can get it to work offline as well. I am not sure if, yeah, you still have to make and build it, and etc, etc, which is, you know, if you're experimenting with this kind of thing, then it's probably not too hard to do, but it's a barrier to entry, for sure. You have to kind of know what you're doing, and you can see it here. But now we come to a whole other option, Mac Whisper, a Mac native, nice GUI-ified version of it. And what I'm interested to see is what it's using. I did, I can't remember if I registered for Pro or not. We'll find out soon. I don't mind. I get the feeling it might be useful, so I don't mind doing that, but what I would like to know, we can see some of the differences between the versions here. Some of the differences between the versions here. Ah, I might try that. What I was interested in knowing is what version it's using. I am not, ah, here we go. So, made by building on top of all the hard work for Whisper C++, but we don't know what version it's using. I mean, I would assume it's updated fairly regularly, but there we go. I don't know why, I don't know what's going on with that web page. Sorry, it keeps bouncing around. Anyway, okay. So, I do have it installed already. I don't know if I have the Pro version or not. We'll soon find out. So, I should try and make fair comparisons. I think I have it updated already. Ah, so here we can see I am not using Pro. I do not have Pro yet. So, let's go for just the best I can get for now. I'm sort of going through the payment. Let's just go for small. Still 500 megabytes. This is where it gets interesting. This is for offline access, of course. Quite nice experience so far. Whilst that's doing that, I'm going to, I would like to have some points of comparison. So, here's an interview I did with the infamous Lars Klint last November. I also discovered that Google Recorder, which I use on my Pixel phone. Yes, I have an Android phone, even though I use a Mac, which is weird, but there we go. Actually, when it uploads the file, it does a transcription as well. So, I'm going to use this as a bit of a comparison and play a little bit of this and see how it comes through. Okay, it's playing. I can't actually hear it. Let me just get that working. Which is strange, because I should be able to hear it. You can hear it. I cannot. That's okay. I can fix that. I still can't hear it. I think I might just have muted my... Smells like old llamas. Yeah, that is working, because now I'm hearing it twice. Yeah, all right. It's quite quiet though. Okay, now I'm joined by Lars Klint. Lars and I have actually known each other for some time back in the Melbourne connection, but I've not actually seen him, like many people, not seen him in Melbourne for some time, but seen him in a few Eastern European cities over the past few years. Lars, how are you? So, you can see that's not perfect, but it's not terrible. Made a few mistakes here. We won't go any further. And let's also try with Otter, another popular tool. This is the same file. It's an M4A, so it's moderately compressed, which might also, I don't think it's going to have a massive impact in this case, but it's transcribing. This will take a little bit of time. So, let's go back to Mac Whisper in the meantime. So, that has downloaded. Yeah. Close. And this is where things are interesting. We have open file, new recording, record at audio, which is interesting because I currently do that with a loopback. We can drag files here. Nice to see it supports MP4 and M4A. So many things still don't, despite the fact MP3 has actually been declared a basically dead format by its creators. But anyway, and we can pick the models here. This is a really nice interface so far. Auto detect the language. Let's just leave it as is. Lars is, I wouldn't say half, but he is originally Danish, but has lived in Australia for a long time. So, he also has a slightly interesting accent, but I think it's more Australian than Danish. So, I don't think it's going to struggle, but that's actually another interesting experiment and test of lots of these tools is to use people who are speaking in, not in their first language, but it doesn't really count for him. But let's do the file. There we go. That's doing its thing. We can see how long this will take as well. This is now done. So, let's have a listen to... Oh, it's still going. Okay. All right. Let's see who wins. That should be interesting. I would argue, I mean, this is working offline, locally, of course. I don't know whose computer is faster, mine or Otter's. Hard to say. But I would say, actually, they're equally as fast, at the very least. Whilst that's processing, I will add MacWhisper as an audio source so we can get it to play. Did I just say MacWhisper? I hope I said MacWhisper. I had a funny feeling I said something else. Okay. That is done. This is now. Okay. So, let's see. It's also got keywords, which is what this one did as well. I don't know where it's put them. They're usually on the... I'm not quite sure where they are. It's definitely on the overview. An interesting thing I also discovered in this, if you use Google Recorder, I'm not sure if it's pixel only, it now has features that are kind of like Descript, where you can actually cut and remove based on text and things like that, which is really cool. Descript is an amazing tool as well. Actually, I could test Descript as well. Because I currently have a paid for month. But this is free, which is actually kind of cool. I haven't really... I only discovered this feature even existed a couple of days ago. So, I haven't really experimented with it very much yet. I haven't really experimented with it very much yet. Okay. Here is Descript.

Speaker 2: Let me upload the file. Don't use this very often. So, I actually remember how to do that.

Speaker 1: Maybe I can do it the easiest way and just drag it on. I would hope and assume they're all going to be using slightly different models and ways of doing this. So, that's still not... I've always found the way that Descript organizes things to be slightly confusing. I must admit, I've never really known where I'm supposed to organize stuff. It has this odd sort of pseudo file system. It's odd because it shows me a plus, but nothing happens. Well, if it's made that complicated, then maybe it's an issue in itself. Ah, choose a file to transcribe. That's it. You have to open a document first. Now I remember. It's a little odd. Okay. So, that's doing its thing. Let's go back to Otter now. Okay. Okay. Now I'm joined by Lars Klint. Lars and I have actually known each other for some time back in the Melbourne connection, but I've not actually seen him, like many people, not seen him in Melbourne for some time, but seen him in a few Eastern European cities over the past few years. Lars, how are you? So, that also made mistakes, but it made different mistakes. Here it got Eastern European correct, but it mixed up. Lars, how are you? Said a lot, how are you? So, it made different mistakes. Let's see if the script is done. Okay. Oh, I need to, give me a second. I need to add that.

Speaker 2: Double check it's got the right microphone. Yeah, it's got transcript. What the hell?

Speaker 3: Oh my god. That's new. Chris smells like old llamas.

Speaker 4: Ow. It works. Wow, Google. Okay. Now I'm joined by Lars Klint. Lars and I have actually known each

Speaker 1: other for some time back in the Melbourne connection, but I've not actually seen him, like many people, not seen him in Melbourne for some time, but seen him in a few. Now, one thing I noticed there that you may or may not have noticed, the volumes of each application is completely different as well. That is a lot louder than the others. Obviously, you get other features with Descript, things like the ability to remove these sorts of words, but again, it made mistakes, but it made different mistakes. Well, it made less mistakes. Bright microphone instead of right microphone and Lars instead of llamas, which I assume the others got right. Yeah. Llamas. Llamas. Lars likes llamas. Okay. Enough of all that. Now let's get to Mac Whisper then. No, not Descript. Where are you? Hiding. Okay. Okay. I am

Speaker 2: not seeing anything.

Speaker 1: Did I miss a step there? Is there any kind of log? No. I'm always interested to see as well, this looks pretty nice, but some of the other AI tools I tried on the stream before still were not completely Mac native. This looks like it is, which is good. I'm not seeing any electron shells here, which is nice. I'm even seeing quick look panels and things. It's using a fair chunk of CPU. That's sort of to be expected. It's doing some fairly advanced stuff, but again, I'm not seeing anything. Ah, here we go. That was just a lot of, I don't know what that is. Just a lot of silence. Okay. So, I assume it works in a similar way. Okay. That's all just the mumble murmur, I suppose. This smells like old llamas. Some sort of threshold it's missed there. Okay. Now I'm joined by Lars Klint. Lars and I have actually known each other for some time back in the Melbourne connection, but I've not actually seen him, like many people, not seen him in Melbourne for some time, but seen him in a few Eastern European cities over the past few years. Lars, how are you?

Speaker 5: I'm good. Thanks, Chris. Yeah. I think the last time we saw each other was in Ukraine.

Speaker 1: So, I mean, again, it made some mistakes. Lars Klint. I mean, that's a name. We are also only using the small model, of course. It missed out some of that earlier stuff that could have just been a volume issue maybe, but the rest of it is pretty good. Didn't make any other mistakes here. So, let's have a look at what else we can do, because there's some interesting features in the tool. We do have the ability to edit, which is cool. I don't know what it's going to edit though. I mean, it's not editing the audio, I don't think. I mean, there's nothing to edit. So, it's just editing the transcript. Yeah. And then we have the ability to copy that line, favorite that segment, or delete it. So, it's just all from the text. That's cool. I mean, we could delete all of this stuff, but we can actually multiple select as well, which is good. I mean, maybe it didn't actually do anything. That also doesn't work. Uh, okay. I don't know. That didn't work, but... Oh, no, it is working. I'm just not seeing it updating. You can see the numbers changing, obviously, slightly. I think. Yes. All right. We have a playback speed, and we have here, what do we have here? We have global find and replace. We have removing the timestamps. I'm assuming this is font size, just big and small. This is nice to have tool tips. That's the, okay, reader format. I guess this will take us... Just changing the alignment. We have, I'm guessing that's copy. It's a clipboard. I don't know. Maybe. I don't seem to have got anything in my clipboard. We have find and replace. Again, we have an export. Now, this is interesting because I saw some of the features are text, various different files for Pro, but we also have SRT. Now, this is the standard format you use for captioning, and you often upload them to YouTube and things like that. VTT is another of those. So, that's useful too. And we have some kind of... Oh, no, speakers. And if we had the Pro version, I think it would do this automatically, maybe. And it's pretty much it. Relatively straightforward in many respects. I'm interested to see... You can also do it straight from YouTube, but I think other features. Let's try one live. Okay. Oh. Oh, there we go. I'm not sure what microphone it's using, though. I don't see a way to set a microphone. So, I'm not sure. Future versions will show transcript live. Let's hope it's using the right microphone. Let's see. If not, music fades. Thank you. Thank you. Not quite sure if that's what I said. I don't know where that's coming from. Let's make sure that this is the... Now, hopefully, that's not going to break OBS. Let's try that again. What's it transcribing? Hang on. This is transcribing. I'm not talking. Something seems to be a bit odd here. Let's try that again. Okay. Hello. I am Chris. On the learning live stream for Wednesday, the something or other of March. Playing with Mac Whisper. Something's not quite right there. I'm not sure what input it's choosing, and I can't change it. So, that's something to add as a feature request. Change the input. Because I don't know what input it's using. Okay. Record app audio. Interesting. Now, this may need to be... Yeah. Quit and reopen. It's fine. Okay. It did do that. But it did do that. But it does say beta. Here we go. Oh, wow. So, this is something I currently use loopback for. I'm doing it at the moment. An audio hijack. That's kind of cool. Let's just for out of interest, let's do OBS. Maybe we should do... I'm not really sure what's going to happen here. Hello. This is Chris. Playing with Mac Whisper on the learning live stream.

Speaker 3: Let's see.

Speaker 2: I'm not quite sure...

Speaker 1: Yeah. Something isn't quite working here. I like the fact you can see... Yeah. I think it needs to take into account people with more than one microphone. And I have many. But anyway, let's try actually out of pure interest... Let's try... I'll try some bad...

Speaker 6: Okay.

Speaker 1: Hallo. Ich heiße Chris und ich wohne in Berlin in Deutschland. Try some very basic German. Let's get the format. All right. I don't know what format that is. I'm not quite sure what format it just saved in there. Export as audio. Okay. Thanks, QuickTime. Very helpful. It's M4A. Good. Let's see. It said it should... It'd be interesting if we could see which language it thinks it is. Because, I mean, it's not terrible. Although, hang on, I haven't... Ah, I didn't. I didn't download German, did I? That's a good point. Let's do that actually. Multiple languages, small. So, maybe it was just transcribing me speaking very strange English. Let's see. Okay. And I guess we can delete that and do it again. Let's see what happens. It's kind of the same. What I would like is if there's a way to say, detect the language.

Speaker 2: Maybe we should try in one last time.

Speaker 1: I mean, it could be my bad pronunciation. No. Okay. Yeah. Could be my bad pronunciation. It got the key bits. Although, these are also kind of used in English. Yeah. Anyway. Not terrible, but not great either. But that's fine. I'm not actually going to use it for German anyway. So, it's fine. All right. Well, let's get back to the webpage. So, MacWhisper. Available, bizarrely, on goodsnews.gumroad.com. It's kind of hard to find it in other ways. This interests me. It also works on Intel, which would imply to me maybe it's not using the optimized versions of things. But we'll see. For free, you get quite a lot of features. No data transmission, which is also quite cool. Because all those other tools I looked at, Google, Descript, Otter. A, it's a monthly subscription. And B, it's going on someone else's server. We looked at a lot of the free options and some of the paid ones. You also get batch. Manually add speakers. I didn't even see automatic recognition of speakers. But okay. Transcribe system audio, which is also quite useful. Sorry, I don't know. That keeps doing that. Another interesting thing, you could do that in theory, maybe. You could hook up a kind of live transcription screen in OBS or something, one-time payment. And it's only $17, I think. And I will maybe contact them about this. But, yeah. Combined segments. Save files to come back to them later. That's interesting. Does that mean? Let's see. What does that mean? Kind of implies. Maybe it just means ones you're editing. Yeah, because I'm still seeing it there. Ah, I see. It has to retranscribe each time. That's interesting. Okay. So, if you've got a very long file you want to go through, you can't come back to it later. That is a slight pain. So, you have to kind of work through it all at once. Or just keep it open, I suppose, is the other option there. Yeah. Okay. And plenty more. All right, then. Well, pretty impressive, really. For 17, double check that. 17 euros for one pro license. That's pretty good. I will compare that to, well, the Google one is free, as far as I know. I don't know how long it's free for. Otter costs, I think, 50 or 60 a year. And Descript is quite expensive. But Descript does a lot of different things, actually. So, not completely fair comparison. But pretty impressive, actually. I think this is something I will probably continue to use, because I do do transcriptions. I don't do them enough for my podcast. I do them mostly for videos now. And Adobe Premiere also has transcription built in, which is actually not massively accurate. But the convenience of having it in the same editing environment is quite useful. But I may start doing it for my podcast. I actually need to catch up massively with all the transcriptions I have, anyway. So, for 17 euros, pretty good. And I will check out some of the other applications that the developer has made. If you enjoyed this learning livestream, and I actually did enjoy it quite a bit, you can find more about me at ChristianSchiller.com and subscribe, leave a comment, say hi. If you're watching this later, have you used Mac Whisper? Have you used something similar? Maybe another Whisper wrapper? Tell me your experiences with any of those. And, of course, I will maybe check back in later when I've got the Pro version and try with a larger model and see if it's any better. But I enjoyed that. All right. I will be back on Friday with my more creative videos, doing something a little different this week. I will be going through a bunch of old music I found from GarageBand and Cubase, and converting them to Ableton in the time I have available. So, something a little bit different. I found it quite a useful exercise, and I'm going to go through some of what I learned from doing it myself as well. So, until then, thank you very much for joining me. And I forgot to do... Actually, now I can do this. I've got my Ferargo soundboard hooked up so we can say, you know, I'm glad you enjoyed the show. And that didn't work. Oh, man, that was disappointing. Why did that not work? That was disappointing. It was working the other day. Ah, I see. Huh. I think I know what's happened. I think my other audio share has overridden it. Well, fine. Fine, Ferargo. Fine. I was going to have, like, a nice outro bit of music that I live triggered, but not today. So, anyway, all that aside, I've been Cristian.

ai AI Insights
Summary

Generate a brief summary highlighting the main points of the transcript.

Generate
Title

Generate a concise and relevant title for the transcript based on the main themes and content discussed.

Generate
Keywords

Identify and highlight the key words or phrases most relevant to the content of the transcript.

Generate
Enter your query
Sentiments

Analyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.

Generate
Quizzes

Create interactive quizzes based on the content of the transcript to test comprehension or engage users.

Generate
{{ secondsToHumanTime(time) }}
Back
Forward
{{ Math.round(speed * 100) / 100 }}x
{{ secondsToHumanTime(duration) }}
close
New speaker
Add speaker
close
Edit speaker
Save changes
close
Share Transcript