Building a Speech-to-Text API with Whisper
Learn to create a speech-to-text API using Whisper, OpenAI's model, containerized with Docker for deployment. Ideal for transcription services.
File
Build a Containerized Transcription API using Whisper Model and FastAPI
Added on 01/29/2025
Speakers
add Add new speaker

Speaker 1: Hello everyone, welcome to AI anytime channel. In this video we are going to build an API service for speech to text. So we are going to use Whisper which is the SOTA model for transcription, speech to text services. So Whisper is again a model that was created by OpenAI and they first released that as an open source model but Whisper is also available through API if you want to use this as an API you can also utilize it. Now here in this video we are going to take the Whisper model and then containerize it. So this API service will be running within a docker container. So if you don't know what docker is it's a container technology that helps you deliver software inside packages you know so you don't have to worry about the dependencies the versions and the conflict etc. So it utilizes docker engine to run the container and that's what we're going to do here and it's it's the open source model that we are going to use not the API by OpenAI so if you see here on the GitHub repository of OpenAI it has something called whisper repository has five different model weights start from tiny to large you can see tiny base small medium and large we're gonna rely on base model here which is that that will run like on a CPU machine with a good enough or good I'll say a decent speed okay the inference speed and if you have a better compute power and you can also use couple of other model weights like large and medium. Now I have a few videos on whisper already for different applications and different combination of videos like generative AI and combining whisper for some of the use cases. Now they also have whisper jacks and faster whisper. So couple of other libraries and wrappers has been built on top of this that you can also use it if you want. Now let's get started our focus here mainly is let me just so our focus mainly here is on the docker thing we are going to write some of the instructions within a docker file will utilize this per set it up and then run inside a container and then we'll use that API service through localhost so let's start building it so if you see I'm going to create a file that's called dockerfile. Now this is how you create a dockerfile over here. So let me just do that. I'm just going to create dockerfile and inside this dockerfile now I'm going to write all of the instructions you know from top to bottom that will that will help me create the image image the docker image and then we'll run that image inside a container know that will power our API service. So the first thing that we are going to define is our base image that which programming language or which base image that we are going to take. So I'm going to use Python here and I'm going to use from Python. That's how you define. So from Python and then you can define different versions of Python. I'm going to rely on 3.10 the most stable version of Python for all thing AI nowadays. You know 3.10 and 3. probably 11 as well now 3.n and 10 by the way and I'm going to use slim you can use different type of images you know slim slim buster because if you need a minimal minimalistic base image that these are the option that you can consider otherwise the docker size docker image size will become too big okay now I have us I'm going to use slim here from Python 3.n slim and I'm going to create a working directory so let's call it work dir and then I'm just going to call it Python and then docker that's my work directory and then I'm going to we're going to create a requirements txt so let's do it so here what I'm going to do I'm going to create a requirements txt file so this is how we write requirements txt all the libraries and require dependencies in Python that we install it through requirements txt we do pip install hyphen our requirements txt or if you also want to do it through poetry you can also do it now the first thing I'm going to do is fast API that's what I need and then I need AIO files and Python multi-part because I'm going to do a POST request and I need this for handling the files okay so Python multi-part and then uvcon which is a web server that helps you fuel your fast API APIs or applications whatever you're building now these are the dependencies FastAPI as an backend framework one of my favorite framework and then we have couple of dependencies that helps you run that API. Now this is what you have to do we are not defining anything related to whisper in requirements because we will directly use it in our docker file through github so let's do it now what I'm going to do is I'm saying okay copy this requirements txt into the working directory that I have created on top so copy and let's do This is what I am going to do is I am copying it and now we are going to use couple of run apt command because you know you can do it and docker supports all the operating system is its OS agnostic you know if you have docker installed in your machine it will run if its working in my system if its running and if you are sitting somewhere in New York or somewhere and if you have docker installed in your machine it will definitely run ok. run I'm gonna do apt-get and run apt-get I'm going to first update that's how we update Ubuntu distro sorry Linux distro run apt-get update and then couple of and and and then apt-get install and I'm going to install git and then just do y for now okay so this is why why I'm doing it because I need git and that's where I will also pull whisper in in this here so let's do it now run pip install this is right so I'll let it just use it now here we're gonna use run pip install we're gonna use whisper here guys okay so let me just do git and then plus and let me just go back to the github repository of whisper you come here and you just click on code and then you just copy it here so let's just copy it go back in the docker file and just do a control v ok now we are installing the whisper directly from the source the github source so this is this this is what this command does now this is done to use whisper we need ffmpeg ok so a command line utility tool for all of the you know codecs etc you know for videos audios manipulations whatever ok processing now this is what I want to do here so run again let's update so apt-get update and then and apt-get install and then ffmpeg here we go we are done with our installation guys now the next thing that I'm gonna do is copy this so it's fine and then expose on a port so fast API services usually runs on port 8000 by default so that's what I'm gonna do here I'm gonna say okay expose 8000 now Now I am just going to write a command line instruction that this is how you are going to run this API that's how basically you run by the way right. So you can see it says CMD executable ok so I am just going to write the executable which runs in a terminal. So the first thing is why it's capital. The first thing is uvcon which is a first server that will help you run this and then my app name. So I'm just going to call it an app. Let's call it fast API underscore app dot not. Yeah. So we get the fast API app and then the name of the app variable. So that's what I'm going to call it app as well. I will show you what I mean by this. Now the next thing is you're going to define a host and then inside that it's going to be a local host 0 0 0 0 0 and then port 8000. We are done with our docker image guys. Docker file excuse me. Now this is how we define a minimal docker file from top to bottom where we are using a base image, creating a directory, copying our requirements txt in the directory and then defining a few installation steps that will help you install git, requirements, whisper from the github source and then also ffmpeg and then we are exposing on port 8000 then just defining our executable command that will help to run the API service that we are going to build. Now we have written our docker file so what I am going to do here is I am going to create a file called fastAPI underscore what we have defined app.py and here we are going to write all of our logic for fastAPI code. Now this is what we are doing here. So in this in here we're going to write our logic for the API for transcription. So you end user can basically do a POST request with their audio file or files and they will get the respective transcriptions from this API service that is going to run inside a container. Now so let's let's write the code here. So I am going to do is from fast API import fast API and then I also need file I need upload file and then I also need HTTP exception. So let's call it. This is right. I also need list from typing because if you are passing multiple audio inside a list so from typing import list what I'm going to do is I'm also going to use from fast API we will have to some responses. let's do responses I'm going to do import JSON response you know for output so JSON response where we'll have our output and then redirect respond that will take you to slash docs you know if you are hitting it 8000 it should take you to the swagger UI right that's what we're gonna you would redirect respond that will help us to redirect it to a slash docs now our fast API thing is done the next thing that we're gonna do is import whisper here so let's do it import whisper and I'm gonna import torch that is required to check CUDA and all if you have a GPU machine you can use it and the next is temporary file so from temp file import to name temporary fantastic okay now these are the imports we are done with the imports let's check okay app so this is the app that you can see this is how we are initiating an app within fast API and this is what we have defined here in this file if you go to docker file that's what calling fast API and app so if your file name is app then this will become app app if if go inside here if this is app but here I'm gonna do a couple of things if you are deploying this in production and if you have a GPU support it's better to do this way torch dot CUDA dot is available why it's not showing it's so bad you know github copilot ok torch cuda dot is available and then I'm gonna write have a variable called device basically it's not a variable it will fine it's we should look at this as a constant not gonna change so torch dot device to then why do I need this above line let's do not load it from here so I'm just gonna call it kuda if ............................................................................................................................................................................................................................ can see the size of this base model 74 M parameters very less by the way if you compare to large language model that nowadays we are seeing but this is right base works fine and then just bind the device so my device is going to be device okay so we have loaded our model here now what I'm gonna do next is going to write a couple of endpoints you know the decorator that will help you you know basically run your API service. So the thing that I'm gonna do is app dot post let's do it and I'm just going to use here something called whisper so let's call it whisper and inside this I'm going to write an async function so async defined and just I'm going to use a handler and inside this I'm going to have a list now if you're uploading multiple audios now if you have multiple files that you want to upload and get the transcriptions then this is how you should do it. Now looks good but let's write the code you know github copilot suggesting something which I think I believe this is right but let's see so this is at least ok so list upload file file this is fine I'll just remove this thingy if not files if length of files not equals one I think this is also correct register http exception status code 400 and then only one file is ah not equals one if length of files or let make it more make it more simpler if not files so if not files then I will say no files were provided or no files were uploaded something like that okay no files were uploaded okay now this is going to be an HTTP exception if the end user is not using any files okay I didn't see it I think this was given here but let's remove it it's fine now if not files raised this is done so for let's also remove this and here I'm gonna write I'm gonna have a empty list of results yeah this is right so results and empty list and then this is also right so I'm going to say okay for file in files With delete true. This is right. So temp dot, you know, right file dot read temp dot seek and then To whisper dot whisper temp name model device device and then this is right or Let's do one thing. Let's remove this With name temporary file delete true as temp. I'm going to do with open so with open temp.name as buffer or let's make it more self-explanatory temporary file buffer and temporary file might be you know similar ok temporary file then I'm gonna write the temp file.write here so let's do then this will become temp file temporary file.write and then just gonna do file. file.read this is correct file.file.read and then let's remove this for now file.file.read temporary file just download the charger file.file.read and then you have let's have one more variable And come out of this with and with name temporary then we have defined our result variable and inside this result variable I am going to use model dot transcribe. This is what I'm going to use. So it has a function called transcribe. OpenAI gives you that function where you can pass your final name. OK the temp. So this is right which is your temporary name. Now result equals model.transcribe and now I'm just going to append this. So result.append or let's append in a better way. So result.append file name and then let's call it not text maybe transcript. So transcript and then this will we have to pass the output so let's only get the text out of it. So result and then text. OK, so this is what Whisper returns. OK, this is fine. So now we have now return JSON response to let's do a return JSON response and I'm going to write content and content. It should be a key value pair where we're gonna give a results and results equals so it's a key value pair we're just going to pass the results okay content we do have to pass a content type then in postman so we'll see that okay now whoo okay this is right so this a handler is done now what I'm gonna do next is I also want to write a more decorator which is app.get and then also do let's have define a response response class and response class would be redirect response and then after this is async def redirect to docs redirect to docs redirect to docs and then return docs ok so we are done with our end points guys so here this is the end point that we have to worry about which is slash whisper ok now you just have to go ahead and upload now how will you create a docker file so you can see this is your file this is your app okay and so what you have to do is you can see this is how we are running this app makes your whatever file name is there it's in the docker image now I'm going to open my whisper docker folder I'm just going to do a right-click and I will show you if you are on Windows and if you have you don't have docker installed you can go ahead and install docker desktop app you know it's available through executable exe file if you are online it's a one-line command you know sudo apt install docker whatever you can just have a look at their documentation now if you open this you can also do everything from the app itself I can also do it from terminal you can see I have something already running that's called whisper API image so let me just show you how you can first check if there is any image if you have already any image there in docker so what the command is docker images you know if you do docker images you can see I already have an image that is running that's called whisper API image not running by the way that image is there to find out if there is any container that is actively running or there's an active container then for you have to do it docker ps command when you do docker ps you will find out that there's no container running for me but I do have an image now if you want to build that image that I already have built what is the command to build it so on if you are on a Linux distro and if you don't have a like if you haven't added docker in a user group then you have to use sudo docker build hyphen T and then the image name so for in my case if my image is whisper API IMG and then if your docker file is in the same directory then you just have to give a dot. If your docker file is somewhere else and if you are trying to build it from somewhere else then you have to give the relative path of that docker file. Now here what I am doing, I am just saying ok docker build command so it is going to build the docker image for me and then hyphen t and it is going to have a name called whisper api img and then its dot means that there a docker file inside it. Now in that case because I am on windows I don't need a sudo command so I am just going to do docker build hyphen t whisper to whisper api img and then dot once you hit enter it will start creating layers after layers it will start in first it will pull the base image if the base image is already not there if we have worked with python base image earlier it will be there if you haven't deleted it ok the delete command is rmrnrf ok that's how you delete. Now it will pull the python image from docker hub official repository and then it will start building layers, it will start installing all the dependencies and then it will give you a docker image that you can run within a container powered by a docker engine. That's how it works guys. Now I already have done it so what I am gonna do here is I am just going to run this image that I already have. If you are doing it for the first time you have to probably wait for 5 to something 3 to 5 minutes depending on your internet bandwidth. Now what I am going to do is docker run, two way of running it, you can either run it in a daemon, so if you don't want to, if you want to run it in a back end process, let it run in a daemon process and if you want to see it and track it out, you know you have to do it like, define the port, it's going to run on 8000, that's what I am going to do here, 8000 and then your image name, so my case whisper api image and just I'm going to run hit enter and I expect that it will start running the image within a container for me what is currently doing is it's pulling that whisper base model weights from the GitHub repository or wherever that has been hosted that's what it's currently doing once it is done it will ask you to open like on a localhost 8000 you can also use it in an application you can try it out test it out in a postman or any other API client but let's wait for it and see what it does once it is built you can you know sip this container to your friends system or you can deploy it there are multiple ways of deploying it you can first push it to docker hub if you don't want to push it to azure container registry or elastic container registry on AWS or also GCP by the way. Now what you see is it says uvcon running on localhost 8000. So now we have our API running on 8000 will just go and try it out. So let me just do that. So what I'm going to do here is I'm just going to search for localhost 8000. So let's open it. Now once I open 8000 you can see a beautiful not beautiful by the way you know a swagger UI. Here you have a couple of endpoints we are interested in post and can see whisper and within whisper we have our try it out option where you can click on try it out once you click on try it out it asks you to add string item files and here on the files you can upload a file so let's first listen to the file that I'm going to upload okay so this is the file I don't know if you can listen it yourself so this is just Many of you have asked if Lanchain is a lot of issues with HP. langchain.mp3 audio you can see it starts loading and it will probably take 15 to 20 seconds to give you the result and you can see I got my transcription here in a very beautified JSON okay it says results file name transcript blah blah blah right many of you have asked if langchain is a production ready framework in my honest opinion langchain is yet not production ready they will be production ready very soon because if you go to the github repository you will find a lot of PRs there. Something like that right. So you could see that one minute audio extremely free running inside a docker container makes it fast as well to be honest and you can now take this docker image and you can first push this to a container registry that can be free paid whatever you know depends where are you going to deploy it you can deploy it within an app service on azure you can deploy it on a container instance you can deploy it on AWS Fargate you can deploy it on a GCP you know Kubernetes engine as well depends wherever you want to deploy it you know it depends what you want to do the entire code will be available on the GitHub repository you can find on AI anytime GitHub repository and you can now probably deploy this or you also can use it for different transcription services that you are working on locally in your infrastructure you can also deploy it on cloud you don't have to pay for whisper API if latency is not a problem you can do some batch processing based use cases like contact center modernization call center modernization where transcription is a key you know where you have to look at call center calls listen to it and derive some insights now this is fine to application this looks nice we'll figure out that why it probably not work maybe I will make this will update it in the video edits but we'll see it out so if you have any thoughts or feedback please do let me know in the comment box if you have any doubts please reach out to me through my social media channels you can find it on my channel about us and also on the YouTube banner if you haven't subscribed the channel yet please do subscribe the channel please like the video and share the video and channel with your friends and to peer thank you so much for watching see you in the next one You

ai AI Insights
Summary

Generate a brief summary highlighting the main points of the transcript.

Generate
Title

Generate a concise and relevant title for the transcript based on the main themes and content discussed.

Generate
Keywords

Identify and highlight the key words or phrases most relevant to the content of the transcript.

Generate
Enter your query
Sentiments

Analyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.

Generate
Quizzes

Create interactive quizzes based on the content of the transcript to test comprehension or engage users.

Generate
{{ secondsToHumanTime(time) }}
Back
Forward
{{ Math.round(speed * 100) / 100 }}x
{{ secondsToHumanTime(duration) }}
close
New speaker
Add speaker
close
Edit speaker
Save changes
close
Share Transcript