Master Assembly AI's Node SDK for Audio Transcription

Convert Your Audio To Text

4.9/5

3718 customer reviews

Learn to transcribe audio using Assembly AI's Node SDK with JavaScript. Dive into features like speaker diarization, summarization, and error handling.

How to Convert Speech to Text in JavaScript using AssemblyAIs Node.js SDK

Added on 01/29/2025

Speakers

Add new speaker

Speaker 1: Hello and welcome. Today let's learn how to transcribe audio and more using Assembly AI's Node SDK. So we're going to be building this code in JavaScript. Let's get started. The first thing that you want to do is to install Assembly AI's Node SDK. So you can find how to do that in our Node SDK repository on GitHub. I will make sure to leave the link to the documentation and to the GitHub repository in the description just so you have it. Here are the commands that you can use to install Assembly AI. I use the npm installer. It will just take a second but I've done that before so I'm not going to do it again. And then let's see what we have to do next. Next we have to import Assembly AI and set up our Assembly AI API key. So all we're doing is importing Assembly AI and then we are setting our API key to create a Assembly AI object. I already have Assembly AI's API key as an environment variable. If you don't you can simply just copy and paste your Assembly AI API key here. If you don't know where to find it it's very easy. You just have to go to assemblyai.com. Make sure you have an account. Once you log in you will be able to copy your Assembly AI API key right here. And once you copy this one you can paste it here. But like I said I already have it. So let's see what we have to do next. You can either follow the instructions in the GitHub repository or also in the documentation. So let's see speech recognition is what we want to do. So what we're going to do is to have an example audio URL this time. So this is a talk show I think or a podcast about the Canadian wildfires. So we're just setting up our audio URL. And I'll just copy this one and go through it here. So what we're doing at first is to create a configuration file here. At first we're only passing it the audio URL. I'm going to show you how to change this configuration to achieve more things than just transcription on the node SDK. So here in the configuration we're creating a dictionary and we're only passing a audio URL to it. And then I'm creating a function. And in this function this is a asynchronous function. We're calling the transcripts endpoint through the client that we created. Client being the Assembly AI object and passing a configuration to it. And as a result we're going to get a transcript. And the reason we have the await keyword here is because the transcription is going to take a little bit of time to be created. It's not going to happen instantaneously. That's why we're going to wait for this line to run. And then we're going to get the transcript. And then we're going to print the text from the JSON response that is being returned to us. One thing to note here is that this client transcripts create function runs until the result is returned as error or completed. So in case you want to make sure that you capture the error option, you should of course add here a little condition to catch when it is returned as error. But here we're just going to go with the assumption that it's going to be completed after a while. And once it's completed we're going to print the text. So let's just run this first and see how it works. And then I will go into more detail. All right. We already got it. It's been around five seconds. And then we have the transcription. But you might be confused about what is being returned because we are getting the text immediately from a transcript. So let's also print the transcript to see what it normally looks like before we extract any information. All right. So basically what we get as a response is a very long JSON file or JSON response. Sorry. It starts from here. So we first get an ID for this transcript. And this ID we can use to get this transcript later in the future. So you do not have to start a transcript from scratch. You can just get a previously created transcript. And then let's see status here as we can see is completed. There are a bunch of options of what the status can be. Maybe let's go take a look at the documentation. So as you can see in the documentation, maybe I'll make it a bit bigger, it can be processing, queued, completed or error. So this function here, the create function, waits until it is returned to be completed or error. The other times it will just make the program wait until it is done. So if it is processing or queued, then it will still wait. And then we can see our audio URL and the text which is what we're printing here. And then you can also get each word and the timestamps of the words. So this will be a very long section. And then you have some more information here. A couple of notes here for the audio URL, you can also just give a audio file. So instead of a URL, I can, you know, if I had a file here that was JavaScript transcription, this would also work. So you don't have to pass a URL or you can even pass a video. So it doesn't have to be audio. You can either pass a video file or you can pass a link to a video. But of course, uploading that might take a bit longer. So while we're waiting for the transcription to be done, what we're basically doing is polling. So this function is constantly asking the assembly AI API whether this transcription is done or not. And one nice thing you can do with the node SDK is you can specify how long you're going to wait for this polling. So at first you can set poll, I think, yes, to be true or false. It is by default true. So it's going to poll and wait for assembly AI to be returned at the status of return as completed or error. And then you can also set up polling interval and polling timeout. So this will tell us when to stop waiting. So these are millisecond values. Normally polling interval is three seconds, but let's make it one second, for example. And polling timeout is normally 180 seconds, I think. But let's make it like 10 seconds and then we can see the result of that. Or maybe I think in 10 seconds our transcription might be done already. So I will make it like three seconds. Yeah, as you can see, because we only gave three seconds for the transcription to be done before there's a polling timeout, there was a polling timeout error that was thrown. So this is how you can have more control. If you want, you can set polling to false. So let's see what happens then. So as you can see, we are again printing the transcription.text and then transcription. Transcription.text is returned as null because this is a transcript object that is being returned. Here is JSON object that is being returned. Again, we have an ID, we have a language model, an acoustic model, language code. It is determined to be English. Status is queued. So it's not completed yet. It's just queued. We have the audio URL. So the text is not created yet. It has not yet been transcribed. And we also don't have the words. And that's why we have transcript.text as null. So this is basically the object that is being returned to you until it is completed. But let's say this is what you wanted to do and you want to get the transcript at a later time. Let's see how you can do that. So this would be basically creating the polling yourself. So I can say while transcript.id is not completed. Sorry, not ID. Status is not completed. So I'm going to get the transcript again. So I'll say transcript. This time await client transcripts. This time I'll just say get. And what I'm trying to get is a transcript.id. So I'm using the ID of this transcript, like I mentioned before, like here. I'm using that ID to get the transcript status. And then I will just print the transcript status here, just so that we're aware. And then sleep for a little bit. One second should be enough. And here's a little sleep function that I wrote before. So it's just going to basically make us wait for one second. So basically what we're doing is first creating a transcript, creating a transcript request, but we are not polling. So we are not waiting for it to be completed. So the transcript status will not be completed at first. And then we're going to keep asking Assembly AI to tell us how the transcript is going based on this ID. And then we're going to check the status. If the status is still not completed, we're going to run this again. And then once it's done, we will print the text. I think this should work. Let's check. Oh, yeah, this is a constant variable. As you can see now, we get processing, processing, processing. At some point, it's going to turn to completed. Nice. So it is completed. And we have the text. So of course, this is a built-in functionality, so you do not have to do it yourself. But let's say you have a more complex application and you need to start the Assembly AI transcription request and then get it later. Or maybe just kind of recall what the transcript that you created was. Then you can use the client.transcripts.get function. All right. Let's go back to default. So I don't need this again. I can delete the sleep function. All right. So there are many other things that you can do with Assembly AI. If we go take a look at the documentation, you'll see that apart from the core transcription or speech recognition, you can do speaker diarization. That is to determine which speaker is saying which of the sentences. You can do summarization. You can do content moderation. So flagging parts of the audio where something sensitive is being talked about. Sentiment analysis, entity detection, topic detection, auto chapters, so dividing the audio automatically into separate chapters. Key phrases and personally identifiable information redaction. So I'll show you how to do the speaker diarization and summary because I think they're quite interesting. So all you have to do is to go to your config and or configuration and then set speaker labels to be true. And that's all. So if we can go take a look at maybe a old JSON that we got returned here, you can see all the different models Assembly AI is offering. For example, redacting personally identifiable information, speaker labels, content safety, and they're all set to false. And basically what you're doing is setting that to be true. And then you will get the results for the speaker labels on top of the transcription that you're getting. So I'd like to run the transcript, print the transcript again, the response that we get. And then it will be interesting to see what it looks like. So again, let's go all the way to the top. We have the words here. And the text. Okay. This is the top. So ID, text, audio URL, everything, the words, and their timestamps. We got it all. And then once you come here, we see something called utterances. And this one is basically every part of the speech, every part of this audio that was uttered by a different speaker. So you can get the start point, start timestamp, and the end timestamp and which speaker this belongs to. And then what they said, the transcription of what they said. Well, let's print this nicely so I can create a for loop. I don't really need this anymore. For each utterance of the utterances, I want to print speaker. We'll get the utterance.speaker. And then the utterance.text. All right. This should work. Let's run it again and see. All right. This is our result. Speaker A says something, something, something. And then speaker B says something, something. Speaker A, so what is it in the haze that makes it harmful? So this seems to be an interview about the Canadian wildfires. Nice. So we can get different speakers like this. Let's say you also want to get summarization. All you have to do, very simply, again, is to set either summarization or summary. Let's take a look at the documentation. Summarization. You just need to set summarization to be true. But if you have eagle eyes, you might have realized that there are some other things that you can set. These are optional. They are not mandatory for you to set. So as long as you just set summarization to be true, you will get informative bullets as a response. But let's see what the other options are. So basically, you can set the summary type and the summary model based on your preferences. Summary type is basically what type of summarization you want returned. So the options are either bullets, which is default, or verbose bullets. So a bit longer bullets. A gist, so just a few words. A headline, a bit longer, kind of like a sentence of kind of a punchy summary. Or a paragraph summary. Like I said, the default is bullets. And then you can also specify summary model. There are a bunch. You can get informative one. So it's a little bit more, how to say, maybe distant, a bit more about the goal is to inform the user. Conversational could be a bit more relaxed kind of language. And then catchy is, you know, if you want to create maybe like a podcast title or when you're creating a video, maybe if you want to put this on your thumbnail, this could be a good option. But like I said, the default one is informative. So let's set those two. Summary type. Let's see what I want. Maybe I would like to get a headline. This time. And summary model. Let's take a look at the options again. Catchy. Let's say a catchy headline is what we want. Once we've done that, I will actually start stop printing the results for the speaker labels. Just so that we can see the result of the summarization a bit better. And all I have to print is transcript. Summary. All right. This should work. So this is actually quite good. We basically got the headline for what this podcast could be. And it is smog advisory issued across the US. So what if we want to get something a bit longer, what if we actually want to use it as a summary? So maybe I will change it to paragraph. And then I will say informative. All right. We got it. So it is smoke from hundreds of wildfires in Canada is triggering air quality alerts through the US. Peter DeCarlo is an associate professor in the Department of Environmental Health and Engineering at Johns Hopkins University. Nice. So we got a nice little paragraph, informative paragraph about what this audio is about. One thing to note here is if you set summary type, you have to set summary model and vice versa. So you cannot set just one of them. So you have to make sure to define both of them in your code. And this is how you can transcribe your audio and start analyzing your audio and maybe get insights from your audio with assembly AIs. You can take a look at the documentation to which I have a link in the description. You can see all the different models that you can use with assembly AI. You can also use LLMs on your audio or videos directly through assembly AI. But maybe we'll go into that in a bit more detail in a different video. This not SDK can be used with JavaScript, of course, also with TypeScript. Make sure you sign up to assembly AI to try all of these different models and transcription on your audio or video files. I hope this video was helpful. If you have any questions, don't forget to leave it in the comment section below. Thanks for watching and I will see you in the next video.

Summary

Generate a brief summary highlighting the main points of the transcript.

Generate

Title

Generate a concise and relevant title for the transcript based on the main themes and content discussed.

Generate

Keywords

Identify and highlight the key words or phrases most relevant to the content of the transcript.

Generate

Enter your query

Submit

Sentiments

Analyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.

Generate

Quizzes

Create interactive quizzes based on the content of the transcript to test comprehension or engage users.

Generate

Back

Forward

{{ Math.round(speed * 100) / 100 }}x

Select Audio file

Convert Your Audio To Text

Secure and Encryption, NDA

4.9/5 3718 customer reviews

1/730

Verified Order

“I needed an interview transcribed accurately and I was happy with the quick turnaround. ”

Jen

Jul 20, 2025

“Very accurate transcription, fast service, easy to use and order, thank you!”

Gabby

Jul 15, 2025

“I am beyond happy with this service, which I am using it produce interview transcripts for my dissertation research. The interface is easy, the customer service was prompt and informative, the transcript is accurate, and the pricing is wonderful. I will recommend GoTranscript to anyone who is in need of affordable human-powered transcription services.”

Justin McDonald

Jun 29, 2025

“great work. quick and professional”

christian oradesky

Jun 28, 2025

We Trust in Human Precision

Value-Driven Pricing

Trusted by Global Leaders

GoTranscript

24/7 Customer Support