WhisperJAX acelera Whisper con JAX y TPUs modernas (Full Transcript)

Cómo WhisperJAX porta Whisper a JAX/XLA para transcribir horas de audio en segundos, con benchmarks, opciones de prueba en HF y Kaggle, y limitaciones.
Download Transcript (DOCX)
Speakers
add Add new speaker

[00:00:00] Speaker 1: You can transcribe a 30 minutes audio in just 30 seconds. Is it not believable? That's possible using WhisperJAX. So let's break it down. There are two words. One is Whisper. The second one is JAX. Whisper is an open-source library from OpenAI that can help you transcribe speech. So if you want to do speech to text, one of the most popular library that we have got in this world that is open-source with permissive license is Whisper. So what is JAX? Google JAX is an open-source Python library developed by Google for high-performance numerical computing, machine learning and deep learning. JAX is designed to provide an easy-to-use interface for writing numerical programs. It is particularly well suited for executing computations on accelerators like GPUs and TPUs. TPU stands for tensor processing unit. JAX is built on top of the popular NumPy library and provides a lot of additional machine learning features such as automatic differentiation which helps users to calculate gradients for optimizations problem. Because of this factor, JAX is quite fast. If you compare JAX with PyTorch or any other machine learning or deep learning library, JAX is quite fast and that's one of the key features or key reasons why people use JAX. JAX also supports something called XLA which is an accelerated linear algebra compiler. So if you know a lot of things that are happening within deep learning is matrix multiplication, linear algebra. So JAX can actually do all these things quite fast on accelerated computing platforms like GPU, TPU. Now having said that what this library is doing is WhisperJAX is combining these two worlds. While Whisper was primarily created using Python to run on CPU and TPU, WhisperJAX is porting that Whisper project to JAX that can run on a cloud TPU and that's how you can transcribe 30 minutes of audio in just 30 seconds and if you do not believe these benchmarks I actually tested it out myself. I took the recent Lex Friedman podcast. It's from a professor Manolis Kellis and it's on the topic evolution of human civilization and super intelligent AI. It's a podcast that is about 2 hours 30 minutes. I gave it to this HuggingFace spaces where WhisperJAX is hosted and this model actually I asked it to do transcribing and it took about 31 seconds. It's not even 30 minutes audio in 31 seconds. I transcribed two and a half hours audio in just 31 seconds. Unbelievable stuff like I could not believe this at all but here we are. So we have we have the entire thing here. So if you want to check it out I can share this link. You can try it out yourself and you have the entire detail here and you can go ahead and then check it out like for example this is the place where Lex Friedman introduces Manolis and you can actually see this is the Lex Friedman podcast and all this information. So it's quite nice and it didn't get the name perfectly Manolis Kellis it says while the the name is Manolis Kellis but you can see that it has done a tremendous job. In fact like the fact that I can transcribe a two and a half hours audio in just 31 seconds is completely mind-blowing and that is possible because of WhisperJAX. So you can access this WhisperJAX using this HuggingFace spaces which you can use if you want to wait in the queue or also the other option that you have is you can go to this repository and click open in Kaggle and this will open this entire thing in Kaggle. The only catch is whenever I tried to do it on Kaggle I had like at least about like 50 I was in the queue waiting there were 50 people before me trying to run this so the TPUs were extremely busy on Google Kaggle. I would like to show you how to do that click that link click edit my copy and that will take you like I've already forked it so that will go take you to my or your own copy. The accelerator that you have to select is the latest TPU not the older one the new one virtual machine and once you go here and click this button it's going to start the machine and it's going to connect to this particular notebook. Most likely you would see a huge huge waiting queue so now it's 21 I was like around 50. All you have to do is run everything and then you will have the transcription ready. So you can see it took about 19 seconds and you can you can see the different different benchmarks as well. I wanted to quickly show you the benchmarks which are quite interesting and also mind-blowing. So there are multiple versions of whisper at this point even whisper CPP exists but let's talk about four different whisper platforms and their respective backend in which it has been run. So OpenAS own whisper library with the pytorch backend a pytorch framework and when it is run on GPU it took about 1,000 seconds to transcribe a one hour audio clip but when you go to transformers that time saves get saved a lot. In fact it took about 126 seconds that's about two minutes for one hour audio. Now when you use whisper jacks on GPU so this is Jack's framework not pytorch and still it is GPU it took about 75 seconds that's like one and a half minutes less than one and a half minute. When you use whisper jacks on the latest TPU you could see that this is happening in 13 seconds in one three 13 seconds you can transcribe one hour audio and that's exactly what we have seen here in 31 seconds almost we have transcribed two and a half hour audio clip and that is quite mind-blowing and I'll link this entire repository in the YouTube description so you can check it out everything. Unfortunately I could not run this on Kaggle like I said because of the queue and this does not work on Google Colab primarily because Google Colab does not have the version TPU version that this jack supports so you cannot run it on Google Colab you can run it on Google Colab GPU but not on TPU you so the best way for you to run this is either like run here or you rent a TPU on any cloud service and you run this code. Running this code is quite straightforward you can see all the example codes here you have to install the library and then you have got a exclusive pipeline flax whisper pipeline and then you can just transcribe it it's quite easy if you want to save memory you can also load it in a half precision and you have a lot of different models that you can load. I hope this video was helpful to you in learning about whisper jacks unfortunately I could not run the code and then show you exactly but it's as simple as running everything here once you get the machine. If you get the machine run everything and then you're quite good and you have got a whisper jacks that can transcribe an audio clip of 30 minutes in just 30 seconds in fact much faster but the benchmark says or the repository says 30 minutes in 30 seconds. If you have made it this far I have an entire playlist dedicated only for whisper that starts from very basic whisper tutorial till building use cases like how do you transcribe a podcast how do you burn captions on a video how do you do speaker diarization how do you get word level time steps I've got this entire playlist on whisper I would strongly encourage you to check it out and see if your interest is speech to text or speech recognition automatic speech recognition I would definitely encourage you to check this out and I'm sure that you would enjoy my playlist thank you so much for listening to me see you in another video happy prompting

ai AI Insights
Arow Summary
El hablante explica WhisperJAX, una adaptación de Whisper (ASR open‑source de OpenAI) a JAX (framework de Google optimizado para cómputo acelerado con XLA en GPU/TPU). Al portar Whisper a JAX y ejecutarlo en TPUs modernas, se logran transcripciones extremadamente rápidas: según pruebas del autor, un podcast de ~2.5 horas se transcribió en ~31 segundos usando un Space de Hugging Face. Se comparan benchmarks: Whisper PyTorch en GPU (~1000 s por 1 h), Transformers (~126 s), WhisperJAX en GPU (~75 s) y WhisperJAX en TPU (~13 s). Se dan opciones para probarlo (Hugging Face Spaces o notebook en Kaggle), con la limitación de colas/escasez de TPUs en Kaggle y falta de compatibilidad TPU en Colab para la versión requerida. También se mencionan ejemplos de uso, carga en media precisión para ahorrar memoria y una playlist del autor sobre Whisper y casos de uso avanzados.
Arow Title
WhisperJAX: transcripción ultrarrápida con JAX en TPUs
Arow Keywords
Whisper Remove
WhisperJAX Remove
JAX Remove
XLA Remove
TPU Remove
GPU Remove
ASR Remove
speech-to-text Remove
Hugging Face Spaces Remove
Kaggle Remove
Transformers Remove
PyTorch Remove
Flax Remove
benchmarks Remove
Lex Fridman Remove
transcripción Remove
aceleradores Remove
media precisión Remove
Arow Key Takeaways
  • WhisperJAX combina Whisper y JAX para ejecutar ASR con XLA en GPU/TPU, logrando grandes mejoras de velocidad.
  • En TPUs modernas, WhisperJAX puede transcribir ~1 hora de audio en ~13 segundos (según benchmarks citados).
  • Prueba práctica reportada: ~2.5 horas de podcast transcritas en ~31 segundos vía Hugging Face Spaces.
  • Comparativa de rendimiento: Whisper PyTorch GPU (~1000 s/h) vs Transformers (~126 s/h) vs WhisperJAX GPU (~75 s/h) vs WhisperJAX TPU (~13 s/h).
  • Para usarlo: Hugging Face Spaces (con cola) o notebook en Kaggle (colas largas por alta demanda de TPUs).
  • Google Colab no soporta la TPU requerida para este setup; alternativa es alquilar una TPU en la nube.
  • Uso sencillo vía pipeline Flax/Whisper; se puede cargar en half precision para ahorrar memoria.
  • Hay múltiples modelos disponibles y ejemplos de código en el repositorio; el autor recomienda su playlist de Whisper para casos avanzados.
Arow Sentiments
Positive: Tono entusiasta y de asombro ante la velocidad de transcripción; se destacan resultados 'mind-blowing' y se anima a probar la herramienta, con notas neutrales sobre limitaciones prácticas (colas y compatibilidad).
Arow Enter your query
{{ secondsToHumanTime(time) }}
Back
Forward
{{ Math.round(speed * 100) / 100 }}x
{{ secondsToHumanTime(duration) }}
close
New speaker
Add speaker
close
Edit speaker
Save changes
close
Share Transcript