Exploring Whisper: OpenAI's Multilingual Speech Model
Whisper by OpenAI converts speech to text across languages using a Transformer model with language detection. Learn how it processes audio.
File
How OpenAIs Whisper model works shorts
Added on 01/29/2025
Speakers
add Add new speaker

Speaker 1: What is Whisper? Whisper is a speech-to-text model released by the team at OpenAI. It works on multiple languages, which is very cool. It uses attention setups in a typical Transformer-esque fashion to actually take what they call a log-mell spectrogram, which is a representation of how frequencies in the audio are changing over time. We then have a couple of convolution layers, positional encoding, as is very common in these Transformer setups, and then a number of encoder layers fed into a number of decoder layers, with the additional output, which is a marker to say it's the start of the tokens, and it's saying, look, this is English. There is some language detection in here, which is quite cool. Please transcribe the following, and then the output becomes the transcription. Transcription by ESO. Translation by —

ai AI Insights
Summary

Generate a brief summary highlighting the main points of the transcript.

Generate
Title

Generate a concise and relevant title for the transcript based on the main themes and content discussed.

Generate
Keywords

Identify and highlight the key words or phrases most relevant to the content of the transcript.

Generate
Enter your query
Sentiments

Analyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.

Generate
Quizzes

Create interactive quizzes based on the content of the transcript to test comprehension or engage users.

Generate
{{ secondsToHumanTime(time) }}
Back
Forward
{{ Math.round(speed * 100) / 100 }}x
{{ secondsToHumanTime(duration) }}
close
New speaker
Add speaker
close
Edit speaker
Save changes
close
Share Transcript