20,000+ Professional Language Experts Ready to Help. Expertise in a variety of Niches.
Unmatched expertise at affordable rates tailored for your needs. Our services empower you to boost your productivity.
GoTranscript is the chosen service for top media organizations, universities, and Fortune 50 companies.
Speed Up Research, 10% Discount
Ensure Compliance, Secure Confidentiality
Court-Ready Transcriptions
HIPAA-Compliant Accuracy
Boost your revenue
Streamline Your Team’s Communication
We're with you from start to finish, whether you're a first-time user or a long-time client.
Give Support a Call
+1 (831) 222-8398
Get a reply & call within 24 hours
Let's chat about how to work together
Direct line to our Head of Sales for bulk/API inquiries
Question about your orders with GoTranscript?
Ask any general questions about GoTranscript
Interested in working at GoTranscript?
Speaker 1: Have you ever dreamed of a good transcription tool that will accurately understand what you say and write it down? Not like the automatic YouTube translation tools. I mean, they are good, but far from perfect. Just try it out and turn the feature on for this video, and you'll see what I'm talking about. Well, OpenAI just released an open-sourced and a pretty powerful AI model just for that. Whisper. It even understands stuff I can't even comprehend not being a
Speaker 2: native English speaker. It works for language translation too.
Speaker 1: The results and precision are incredible, but what's even cooler is how it works. Let's dive into it. But first, let me introduce this episode's sponsor that is highly related to this research, Assembly AI. Assembly AI is the API platform for state-of-the-art AI models. From startups to Fortune 500 companies, developers and product teams around the world leverage Assembly AI to build better AI-based products and features. If you are building a meeting summarizer, podcast analyzer, or really anything related to audio or video and want to leverage AI to power transcription or insights at scale, definitely check out their API platform. More specifically, I wanted to share their summarization model, which I find really cool. As the name says, with this model, you can build tools that automatically summarize your audio and video files. The model is flexible to fit your use case and can be customized to different summary types, bullets, paragraphs, headlines, or a gist. It all works through simple API calls, and you can find all the information you need for the summarization model in Assembly AI with the first link below. When it comes to the model itself, Whisper is pretty classic. It is built on the transformer architecture, stacking encoder blocks and decoder blocks with the attention mechanism propagating information between both. It will take the audio recording, split it into 30-second chunks and process them one by one. For each 30-second recordings, it will encode the audio using the encoder section and save the position of each word said, and leverage this encoded information to find what was said using the decoder. The decoder will predict what we call tokens from all this information, which are basically each word being said. Then, it will repeat this process for the next word using all the same information as well as the predicted previous word, helping it to guess the next one that will make more sense. As I said, the overall architecture is a classic encoder and decoder, and I covered it in multiple videos similar to GPT-3 and other language models, which I invite you to check for more architectural details. This works as it was trained on more than 600,000 hours of multilingual and multitask supervised data collected from the web, meaning that they trained their audio model in a similar way as GPT-3, with data available on the internet, making it a large and general audio model. It also makes the model way more robust than others. In fact, they mention that Whisper approaches human-level robustness due to being trained on such a diverse set of data, ranging from clips, TED talks, podcasts, interviews, and more, which all represent real-world-like data, with some of them transcribed using machine-learning-based models and not humans. Using such imperfect data certainly reduces the possible precision, but I will argue it helps for robustness when used sparsely compared to pure, human-curated audio datasets with perfect transcriptions. Having such a general model isn't very powerful in itself, as it will be beaten at most tasks by smaller and more specific models adapted to the task at hand, but it has other benefits. You can use this kind of pre-trained models and fine-tune them on your task, meaning that you will take this powerful model and retrain a part of it, or the entire thing, with your own data. This technique has been shown to produce much better models than starting training from scratch with your data. And what's even cooler is that OpenAI open-sourced their code and everything instead of an API, so you can use Whisper as a pre-trained foundation architecture to build upon and create more powerful models for yourself. Some people have already released tools like the YouTube Whisperer on HuggingFace by JeffisTyping, taking a YouTube link and generating transcriptions, which I found thanks to Yannick Kilcher. They also released a Google collab notebook to play with right away. While some think competition is key, I'm glad OpenAI is releasing some of its work to the public. I'm convinced such collaborations are the best way to advance in our field. Let me know what you think, if you'd like to see more public releases of OpenAI, or if you like the final products they build like Dali. As always, you can find more information about Whisper in the paper and code linked below, and I hope you've enjoyed this video. I will see you next week with another amazing paper.
Generate a brief summary highlighting the main points of the transcript.
GenerateGenerate a concise and relevant title for the transcript based on the main themes and content discussed.
GenerateIdentify and highlight the key words or phrases most relevant to the content of the transcript.
GenerateAnalyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.
GenerateCreate interactive quizzes based on the content of the transcript to test comprehension or engage users.
GenerateWe’re Ready to Help
Call or Book a Meeting Now