Exploring AI Dictation and Transcription Tools
Discover AI dictation apps, their pros and cons, including Mac Whisper, WhisperFlow, and Super Whisper, with a focus on features, usability, and privacy.
File
Top AI Dictation Tools MacWhisper, Wispr Flow, SuperWhisper, and a Local Alternative with Aiko
Added on 01/29/2025
Speakers
add Add new speaker

Speaker 1: Hey guys, my name is Robert. Recently, I have noticed that dictation tools are becoming more and more popular with the rise of AI applications. People are realizing that talking to your computer instead of typing can save a lot of time. In this video, I want to explain to you what this is all about, especially with AI involved. It's a big topic, but here I will just give you an overview of some applications that I have tried, their pros and cons. And I also want to show you how to do this for free, locally and without the internet. Now, dictation itself isn't something new. Computers and phones have had this feature for a long time, and there's always been specialized software for this as well. But with Whisper AI, a model trained in multiple languages and amazing accuracy, the possibilities have expanded a lot. While Whisper and similar models are not perfect, when combined with LLMs, the time-saving potential and everything that you can do is kind of mind-blowing. Let me break this down in a more simple way. There are two main parts to this new AI dictation that we're starting to see around. First, there's transcription. That's where the speech-to-text AI model turns your voice into text. The second part involves running that text through a large language model, or LLM for short. This is where a lot of the cool stuff happens. LLMs, like ChatGPT, can improve grammar, format text, translate, summarize, restructure. It can answer questions, help with coding, explain complex topics, and so much more. It really is like having a super smart assistant with you. By combining highly accurate transcription with these powerful language models, we're opening up a whole new world of possibilities. It's an impressive integration of voice and text technologies. There's a lot more to this, especially since new speech-to-speech models are starting to appear, but that's a topic for another time. Okay, let's look at how this works in practice. I'll cover some popular apps that use dictation that can be run through language models. I'll try to cover user experience, features, limitations, pricing, and some thoughts on privacy and workflow integration. First up is Mac Whisper. This app has been around since early. Whisper AI models started to appear online, and it is great for audio transcription. It can process audio files, create subtitles from YouTube links, and has many advanced settings that I haven't found anywhere else. Notice here that I am talking transcription, which is taking an audio file and transforming it into text. Only recently has Mac Whisper added dictation, which can also be processed with large language models for more accurate grammar, translation, and other tasks. Okay, so I'm here on Mac Whisper, and as you can see, we have many different options, and dictation is only one of them. One of the cool things about Mac Whisper is that it allows you to use local models for transcription. I even think this maybe was the main function of the app to allow you to interact with these models, and as you can see, there's so many, to the point that it actually gets a little bit confusing. You have to click this button to see the differences. The accuracy and speed, of course, depend on factors like model size, model training, whether you are using cloud models or your system specifications. For this example, I will just be using OpenAI Cloud, and it's really cool that they also allow you to use this. Let me go inside the dictation area. Here, you can set, well, on the top we have a video showing you a little bit how this works. Then we can set up our own custom keyboard shortcut to start dictating, and here we have an area where you can select a prompt. I will start with nothing, but the AI service that I will choose will be GPT-40mini. This one is not the smartest model, but it's pretty fast. You can choose between many different services here. You can choose even local models, and that's a very nice recent addition. Let me go back in dictation, and down here, you can make your own different prompts if you want the transcription to be processed like I was explaining to you a moment ago. I will start with nothing, and here this one, for example, cleanup. This one will just have fixed grammar mistakes, spelling, stuff like that. This one will translate, and this one is just AI assistant. I will show you how that works in one moment. This one only says this. Now, let me just start dictating so you can see how it looks in action. And I am dictating, I am speaking to my microphone, and this should be transcribed. And I press the shortcut again, and there it is. And actually, even though I didn't run my text through any AI prompt, it still looks okay. The point with running our transcribed text through an AI prompt is that many times, and this applies to all of the applications that use Whisper, by the way. I mean, there's some limitations on regards to how the model works. It will not insert line breaks between paragraphs. For example, sometimes it will make mistakes with punctuation marks. So even if it's simple stuff, you can ask it, you know what, can you clean up my text? Can you just make sure that all the punctuation is correct? And of course, you can start to do more things. Here in Mac Whisper, here is where the things start to get a little bit frustrating for me. Let me start dictating, and I will change to an AI prompt. So here I am, and now I have to go here. I have to click, select translate to Spanish, for example, and now I will be talking to my microphone, and I will stop the recording. And as you can see, it didn't translate to Spanish. This is a bug, obviously, because I have to click, then I have to stop recording, and then I have to restart again. And as you can see also, it didn't insert any punctuation marks anymore. So there's something wrong there. The first time it worked okay, the second one so-so. And now I am talking, and the text that I am speaking out

Speaker 2: should come out in Spanish. And there we go.

Speaker 1: Now it worked, but only after I stopped recording and then started again. So it becomes a little bit annoying, to be honest. I think this can be fixed. Also, the way that the text is inserted is character by character, which I think also is not ideal, because if you dictate a lot, it will go like typing text by text. And let me just do another test, because the cool thing about using AI prompts is that not only you can make sure that it has the correct punctuation and command like translations and stuff like that, but you can actually start using like an AI assistant. And that's what I set up in the settings that I was showing you. Let me go here. Now let me change to AI assistant. Now close, and I will restart again just to make sure that the prompt is activated. Can you please tell me one paragraph how the whole AI models that take audio

Speaker 2: and transcribe to text work? Okay. See, it takes a little bit typing everything

Speaker 1: when maybe it could be better if it could just paste all the text after it already receives it all at once. I feel like this character by character can also give problems. If you are here and by accident, you jump to another window and it tries to continue typing. For simple things like transcription without any AI prompt, or maybe for simple prompts, it's okay. When you start to get more into trying to use different prompts and trying to dictate and then having to go click here, it just feels a little bit not very effective. There's no Apple script or URL scheme to directly trigger different prompts. And there's no way to find your dictated file after you stop recording, which is kind of scary if something goes wrong. The developer has mentioned that they plan to add a history feature soon, but we do not have that yet. Of course, it's really cool that you can do this with online models and with local models. Even the AI thing, you do not need the internet for anything. So that's really, really cool. So as negatives, in short, the interface or user experience feels a bit rough. There's limited ways to control the app. There's no history feature yet. And some advanced dictation options are missing. While the dictation feature works okay, as an advanced user, it feels a bit like an afterthought. You also need your own API tokens to have access to the best cloud models. This is actually not a real negative. This is just something to be aware of. I still appreciate the option to have these cloud models, to be honest. On the positive side, it is a one-time payment for about $40. You get amazing transcription features. For file transcription and creating subtitles, it really is at the top of the game. I haven't seen any other whisper app offer this amount of advanced settings that can be super helpful if you are dealing with difficult or noisy audio files. I can only expect the dictation and user experience will likely improve over time. Another positive is that the developer is very responsive and updates the app frequently. As I told you, Mac Whisper allows local processing, and this is also great for privacy or if you are in a place without internet access. Despite some limitations, I think it's totally worth the investment, especially if you do video work and need subtitles or transcriptions. It's also great if you are looking for some basic but accurate dictation without lots of bells and whistles. Okay, the next application that I want to cover here is WhisperFlow. And WhisperFlow is very popular recently. And to be honest, it's a little bit of a tough one to talk about. I will try to be very objective here. Right away, you open the app and it just feels so natural, so nice, very minimal. You can quickly realize that this app was built with voice at its core and it's not just one extra feature, like it feels on Mac Whisper. And here we just have Home, Dictionary, History, Settings. We have something like a dictionary also on Mac Whisper, but for example, let me go to Mac Whisper and to find the dictionary, here we have to go inside Settings and Find and Replace, which basically does the same. You can give some words here that you want to always be transcribed correctly. Well, as you can see, the difference of settings, it makes sense because in Mac Whisper, you do have so many other things to deal with, but many of the things don't actually have to do with dictation. And here, the actual settings that you have for dictation are so minimal. You do not have to choose different models. You do not have to choose or set different AI prompts. Actually here, even though you can customize your hotkeys and all that, you don't even have to mess with all of those settings. You can right away go to a note or an input field and just, I am pressing down the Fn key and I'm holding it down and I'm talking to my microphone and I want some text to appear on my front app the moment that I release my Fn key. And the moment that I release it, there it is. And it has the correct commas, the correct periods. It just feels very intuitive. And if I don't want to be holding down the key, I can also just lock it down, like with Fn and the spacebar. So I'm not using my hands to press any keys down and it's recording in a very similar way as it was doing with Mac Whisper. If I want to stop, I just have to press the keyboard shortcut again. And there we go. My text appears on my front app and it's not typing character by character like we were seeing and still feels very, very fast. Actually, it feels much faster than Mac Whisper even with the cloud models. Now, even though the user doesn't set different AI prompts or if they want to use AI or not, my guess is that there's still some AI processing going on on top of the transcribed text. Because depending on where you are on your system or depending on the content that you dictate, it will be formatted very cleanly and very nicely. And actually, there's something else that is called flow commands. And in Mac Whisper, for example, if I want to change to a different AI prompt, I have to go and then click there and then I can use it as an AI assistant. But here, it seems to me like by using a word or a phrase, the application will recognize this and it will automatically switch it for you. So I can say, flow, explain to me in a couple of sentences how AI dictation works. And there we go. I didn't have to go and change anything by using my voice alone. It kind of like switched mode to this AI assistant. All of these things that I have been showing you, dictating or even using this as an AI assistant, is something that you can do with Mac Whisper, even though it's a little bit slower and even though it feels a little bit clunkier or you need to do more of a manual process. But you start to have here some more features that are a little bit more advanced because you can do stuff on selected text, for example. I select text and I can order something. Flow, expand this text talking about the same topic. And there it is. So it kind of understands context. Not only that, but I can also, for example, ask it to open websites. Ask Perplexity, what's the capital of Mexico? And there we go. It sends me to the Perplexity website and it is just so cool. I mean, it just feels really, really nice. Let me close this and let me go back to my note. So the way that different actions or different modes are used by voice is kind of mind blowing, is kind of magical. Now, this starts to become a little bit concerning because you don't know how much this application is reading from your screen or listening to you or grabbing whatever you type and stuff like that. Because, for example, I don't even have to have text selected and I can ask stuff from whatever I have here on top of me. Flow, in just two sentences, describe to me what it says on the top because it's still very difficult for me to understand. There we go. I didn't have to select anything and it understood that I'm talking about AI dictation. It simplified this text. This context awareness, even though it looks cool, also makes me a bit concerned. The app only works online, which means that it is sending data somewhere. The website does mention that they care about your privacy, but the extent of data capture and how it is used, it's not clear. The development team seems to keep all of this information secret, which, to be honest, is kind of worrying. Their marketing efforts are impressive, for sure, but their behavior on some platforms like Reddit, where they keep promoting the app and then deleting their posts or throwaway accounts whenever there's some criticism, well, this has been definitely questionable. And I'm not making this up, I have seen it firsthand. The positives of the app include a very user-friendly interface, accuracy, speed, and both transcription and AI assistant capabilities in a very intuitive way. Another good thing is that it's developed by a team, potentially meaning faster development and better support. On the negative side, well, for me, it's mainly the lack of transparency. It's unclear what's being captured on your system and the app adds itself as a login item each time that you open it. It constantly sends data over the internet and uses significant system resources, even when not in use. There's that, and for more advanced users that want to have more control, the lack of customization options and settings is also very limiting. Another negative that I see is that WhisperFlow is a subscription-only app. I mean, I told you that this is a tough one, even though the app is a very powerful tool and overall it feels great to use, the privacy concerns and lack of transparency from the team make me quite uncomfortable. I hope not to get many dislikes for this, but I will be uninstalling it after this review. Whether it's worth the subscription price depends on your needs and comfort level with these issues. Now I want to briefly talk about Super Whisper. And Super Whisper is an application that is packed with features and flexibility in a way that, in my opinion, goes beyond the last two. So it's a little bit difficult for me to keep this brief. And let me start this by showing you not the settings. Let's see how Super Whisper can do what we did with the previous two apps. I will be pressing the keyboard shortcut to start dictating and I will hold it, then I will release. Here I am talking to the microphone and I am dictating some text. And the moment that I release the keyboard shortcut, my text will appear in my front application. And there we go, it appears super fast. Depending on the model, and we will look at that in one second, it can be just as fast as Whisper Flow. Very fast and very natural if you want to be holding down the keys, or also you can just lock the recording. And in this way, I'm not even pressing down the keyboard shortcut and it's still capturing my voice. And the moment that I press the keyboard shortcut again, it will just paste very fast. And again, it just goes there to my front application. Okay, now what I was showing you before in Whisper Flow of using it as an AI assistant, we can also do something very similar here. I need you to tell me in a couple of sentences how the whole AI dictation works because I'm having trouble understanding this. I ask this and it answers my question. It's no longer dictating. And I can select this. And this is something that we didn't see in Mac Whisper, but I was telling you, it's an advanced feature in Whisper Flow. I can select it and just say, please expand on this topic. I still need a few more sentences. And there we go. I didn't have to say anything else. It was able to understand my selected text and expand on it. Not only that, I can also go a little bit below and again, ask. Now, what I need you is to give me some keywords for me to understand what this is all about. Just some keywords of the topic that I have been discussing here in this document. And again, there we go. Without selecting text, it understood my request. It understood the rest of the text on my application. That is just to show you that this app can behave in a similar way in regards to the advanced features of Whisper Flow. It doesn't have what is the commands to open websites and all that. Maybe it will be incorporated at some point. It's not a deal breaker for me because what you get on the other hand is an incredible amount of flexibility that you do not have anywhere else. The settings window is not as simple as Whisper Flow, but unlike Mac Whisper, you can see that all of these configuration options still feel very focused on voice dictation and how the app behaves in relation to that. If you feel like all of this is too much, you can hide some of the advanced features, but I think that you are missing out if you do that. This is still very easy to navigate. And down here, you can set it so that the app behaves not only like what I showed you, but you can have a recording window that we will look at in a moment. And I will also disable paste. Super Whisper has something called modes. Let me show you. With the different modes, you can set different system prompts with different models. And you can decide if you want this mode to use copy text or if you want this mode to use application context. And the application context is grabbed by the accessibility API of your system. The developer on this application, unlike something like Whisper Flow, here he's super, super transparent. Actually, every time that you do some dictation, you have access to all the text that the model received. And here, for example, I can see user is currently using bear focus on this element and then it shows all my text from my input field. If I have text on my clipboard and if I chose to use that in the specific mode, it will be able to grab it there. So having access to this alone, it just allowed me to understand so much more how the application works. And the one thing that this doesn't have, as I was telling you, is the different commands for different actions. So to switch between different modes, I cannot do that only by using voice. What I was doing in the example that I showed you at the beginning, I was using a different keyboard shortcut to activate a different mode. And for me, that still feels very intuitive, specifically because I have many different modes set up for myself. It's not just like one mode for dictating and one mode for AI kind of tasks. No, here I have so many different modes. And something that I really appreciate also from Super Whisper is that it gives me so many more automation opportunities. That's why I was even able to create this Alfred workflow. So I can very simply select a mode and start dictating in it, or just switch between modes. I also can jump to the settings, to the history. I can toggle recording. So all of these, I cannot do with Mac Whisper. I cannot do with Whisper Flow because they don't have these kind of deep links or URL scheme or AppleScript. Here I can even show on my menu bar, which is my actual mode. And this is something that I did with Keyboard Maestro. So I always know where I am. I always know what I have to press. I have different keyboard shortcuts to jump between different modes. Very intuitive for me as well. Maybe not as simple as using voice, but for me using voice alone in this kind of application feels limiting. So going from this to Whisper Flow, I just feel like they put me in kind of a cage, you know. Here we have different AI models. We have the possibility of using cloud models with your own API token, if you want that, or you can use some models locally that you can download. Language models included on the application, you can use any of these without using your API tokens. This is super, super nice because as these are popular models, you already know kind of like what a model is good for. So these combined with modes allows me to, for example, have one coding mode to use with Sonnet 3.5 or a dictating mode to use with GPT 4.0 Mini or a mode to come up with grading ideas using Haiku. You know, all of that because of the control that the user has and because of the power that the developer has allowed the user to have access to with all of these settings, with all of these configuration. My favorite apps are those that allow you to expand your interaction with them with your own needs, ideas, and customization. And because of that, I love this. There's so much to say about Super Whisper. It really is one of my most used and favorite apps right now. So as positives, Super Whisper, in my opinion, has the perfect amount of features to be used either for dictation or as an AI assistant. Like I said, you have total control over customizing different modes. I use it for writing ideas, structuring outlines, summarizing, coding, dictating, and helping with grammar or structuring video scripts like the one that you're watching right now. Super Whisper also has similar context awareness features as WhisperFlow. This feature is optional and you can set it up per mode, but it's amazing for interacting with both typed text and voice. As I was telling you with Super Whisper, not only you have the option of pasting the result directly on your front window like the previous two applications do, but you also have the option of having that separate recording window that we saw in the settings. And I actually use that a lot more because I can go to any application or I can be on any website and I can select some text and I can say, please simplify this for me because it's too complicated for me to understand. And in this kind of situation, of course, I don't need the text pasted to my front window. And in case I do need that, I can just go to a note and I can just paste because I know that I have that on my clipboard. Super Whisper offers a subscription, but aside from that, there's also a lifetime payment option currently priced at $250. It took me like five minutes of testing the app to decide to buy it, not only because of all the features, but also I don't know where else I will find unlimited access to models like Sonnet 3.5 or GPT 4.0 at this kind of price. I know that new models are coming out every other day, but what is included here is really, really good and the developer keeps the app up to date. The negatives, well, the lifetime option isn't exactly cheap. I think it's absolutely worth it, but it's a bit of an investment upfront. You can also go with a subscription if you prefer. If you're not into tinkering with settings, you might also find it a little bit overwhelming at first. There's definitely a learning curve for the more advanced stuff. While one of Super Whisper's biggest strengths is all of the settings and control that the user can have, this can also lead to confusion and misunderstandings about how the app works. My biggest concern with Super Whisper is that currently it seems that it's only one developer handling coding, marketing, business, and customer service. While most effort goes into app development and fixing serious bugs, I just don't know how much bandwidth there is for growth. So far, I can see that important bugs and issues from users do get attention. It just bores me a bit because I've seen app developers become totally unreachable after getting more popular. And I think that this app has a potential to become pretty big. So unless there's a few more team members, I don't know about this in the future. It's still impressive that the developer has done everything by himself, and I'm grateful to have this app. So I'm very, very happy to recommend it. I told you that I wanted to give you a way to do this for free. And there's this application on the App Store, which is called iCode. And iCode also uses Whisper for transcription. And actually this application is not originally made for dictation, but the really, really cool thing about this app, well, number one, it's free. Number two, it uses Whisper models, which are very accurate, like I was showing you and telling you before. But this one has integration with Shortcuts. So I created a small Shortcut that I will be sharing with you. Basically with iCode, and let me go to the Shortcuts app. Let me open the Shortcut that I made. And I have another application, which is called LM Studio. And with LM Studio, you can download open source AI models. And there's, for example, Lama 3.2, which is pretty good. And you can set up like a local server, and this works totally offline. I will start it here, and Start. And here with the Shortcut, you do not even have to change anything. And this Shortcut will work offline using iCode and LM Studio. If you have something like Olama, you probably have to go down here and change the URL for that. But it also works with Cloud models if you want. So if you enter Cloud here, and if you enter your token here in this space, then it will use OpenAI Whisper, and it will use the model that you set here. So this Shortcut can work both for local or for online. And again, like we saw before, everything will depend on the system prompt. So here we have this field where you can say,

Speaker 2: you are a translator. Hello, I am just testing this,

Speaker 1: and I'm making sure that it works properly. And again, the iCode takes that, sends it to LM Studio, and I get the result. And I set it like this as a pop-up because I feel like this Shortcut is so basic, and it's a great opportunity for you to also understand how this works. If you want a behavior like the one that we were getting in the other apps, you can actually just put some script there, so that it can press Command V. One thing I have to mention is that the Shortcuts app has this time limitation when making online requests. If it takes too long, the app will quit. This is a problem because if your dictation is too long, and LM Studio takes a while, or if OpenAI server is slow to transcribe everything, then it will time out and you will end up with nothing. But don't worry. If that happens, know that the latest transcription file, the audio file, is always in your Shortcuts folder. That's a good backup, and you can just go there, select the file, right-click, and share it to the Shortcut to try transcribing again. That's there as a safety net so you don't lose your last dictation. If you plan to use this for longer recordings, for that case, I actually created a very, like a copy of the Shortcut, and I connected it to Keyboard Maestro as a micro. And I will be giving you also that one in the description of the video, in that link. And with that one, you don't have to worry about the limitation of the Shortcuts app because the actual processing with a local model will be done by Keyboard Maestro, and that's much more flexible. Again, that one will also appear as a pop-up, and you can just decide to paste it or do something else with it. For even longer files, you probably have to do some tests, and you probably will want to do everything with Keyboard Maestro and skip the Shortcuts app totally. But that's beyond the simple alternative that I wanted to share here. As I'm wrapping up this video, I just want to tell you, if you want the absolutely best that I have tried so far, there's SuperWhisper. Also, a few months ago, I made an Alfred workflow called Kiki. And the reason that I'm so familiar with this topic is because Kiki incorporates a similar functionality to SuperWhisper. You can do a lot of the same things. The problem is, well, it's not super easy to set up. There's a long documentation, and maybe in the future, I will make some more tutorials for it. But unless you know how to deal with AppleScript and trigger Alfred workflows externally with arguments and all that, you probably don't want to mess with it. It's also super powerful, and you use it with your own API tokens. But I just found this new app, and well, SuperWhisper gives me very good models for a one-time payment. So I am mostly using this for now. If you don't want advanced features, there's MacWhisper, which is a bit rough on the edges when it comes to dictation, but still very good for its price, especially if you also need file transcriptions and subtitle creation. As I said, I can only expect this app to get better with time. But if you need dictation alone, maybe you'll be totally okay with something like the shortcut that I shared with you. Keep in mind that using it with cloud models, it will be faster than using it with local models. And a lot of the quality in the output depends on the system prompt that you set for it. That's something to be aware of. Then there's WhisperFlow that I mentioned earlier. Honestly, aside from the negatives, I think it's a very good application. If they decide to change how they deal with transparency and privacy, and if they can incorporate local models, or maybe if they can make it a native app to be faster and not take up so many resources, then I think it can become a very good option if you don't mind a subscription. It's just not personally worth it for me. But if they fix all the other issues with privacy and the shady behavior, who knows? It really feels very good to use. I'll leave that one up to you to decide. There are many other AI applications that also incorporate whisper transcription. So if you're a developer who made one of these apps and you want to promote it, feel free to leave a comment on the video with the name of your app, how it stands against what I have mentioned, and a link to it. I will not remove it. I think it's always good to have options. Thank you so much for watching, guys. I will be grabbing an article version of this video with a link to the shortcut and Keyword Maestro macro if you want to grab that from the description. Remember that you can sign up for my newsletter for more frequent updates on what I'm up to, and I'll talk to you soon.

ai AI Insights
Summary

Generate a brief summary highlighting the main points of the transcript.

Generate
Title

Generate a concise and relevant title for the transcript based on the main themes and content discussed.

Generate
Keywords

Identify and highlight the key words or phrases most relevant to the content of the transcript.

Generate
Enter your query
Sentiments

Analyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.

Generate
Quizzes

Create interactive quizzes based on the content of the transcript to test comprehension or engage users.

Generate
{{ secondsToHumanTime(time) }}
Back
Forward
{{ Math.round(speed * 100) / 100 }}x
{{ secondsToHumanTime(duration) }}
close
New speaker
Add speaker
close
Edit speaker
Save changes
close
Share Transcript