Create a conversational app with IBM's AI tools
Learn to build an app using IBM Watson for speech-to-text, virtual assistant interactions, and text-to-speech tasks, enhancing AI integration.
File
Voice-Interactive with IBM Watson Integrating Speech-to-Text, Watson Assistant, and Text-to-Speech
Added on 01/29/2025
Speakers
add Add new speaker

Speaker 1: Welcome back. In this particular video, we will use some of the IBM tools like Speech-to-Text, Watson Assistant, and Text-to-Speech to create an application which will initially record audio from the microphone, transcribe the audio using IBM's Watson Speech-to-Text, and then send the transcribed text to Watson Assistant and receive a response, and then convert the assistant's response to speech using IBM Watson Text-to-Speech. And finally, we'll play the assistant's verbal response in this video. So let's get started. To start, we'll need all these three different resources. So if you do not have an IBM Cloud account, I would recommend you to create one, and then inside the catalog, you can go and search for first, let's do Watson Assistant. Create a location, something close to me. You have read, and create a Watson Assistant. Once the assistant is created, I'm going to say launch Watson X Assistant, and then create a first assistant. Here, I'm going to say meaning of life, and I'll not add any description. I'll go next. In this particular say, I will say web, tell us about yourself, say banking and financial, lead strategist. I want to provide confident answers to common questions, so I'll just use some of the first default ones. Next. I am not going to use any of these things, so I'm just going to say next and keep the default, and then say create. Now once it's created, I'm going to create an action. In here, I'm going to create a simple action where I'll say, what is the meaning of life? Save, and assistant will say, the meaning of life is 42. That's it. That's the end of action, and I'm going to save this, close it. It's saved. We have that assistant created. Next step in our thing is let's create speech-to-text and text-to-speech items. I'm going to go back to my catalog, and here create speech-to-text. Keep it in the same zone. Create. Gives me getting started, but I'll directly go to service credentials, and this is where my credentials are, so that's all I need. Next, I'm going to create another resource. I'm going to change this location again to same Dallas, and create. Once I have all my items available, I can go back to resource list and make sure that I open them individually, separately, so I can link a new tab. I think I have Watson Assistant already open here, so I need this first one, speech-to-text, and the second one is text-to-speech. Once I have everything open, I will probably go here and take down the text-to-speech API credentials. This guy's credentials are there, so I'm going to paste it here, and the URL, service URL, I'll save it here. Next one that I need is, was it this one? Yeah, speech-to-text. Copy this, and I'm going to call it speech-to-text, and copy the URL. Next, I'm going to go to Watson Assistant, and here I can go to assistant settings, and in here, I will find assistant ID and API details, so I'm going to copy these things. I need, in this particular case, I'm running a draft environment, so I'm going to copy the draft environment ID, assistant, and this is my draft environment ID, and for my service URL, I have it here, or I can always go here and get it from the same place, which is right-click, open link in new tab, and I can copy the API key here, and URL from here. So, I have everything that I need, so let's go ahead and build this structure. The first step is to speech-to-text, so let's create a folder here, new folder, I'm going to call it code, open this in Sublime Text, right here. I'll keep this handy in this particular tab here, create a new file. So, for first, we'll just do speech-to-text, first box, where I'm going to record some audio and convert that audio into text. So, for that, I would use a bunch of libraries like PyAudio for capturing audio input from the microphone, and then I'll use Wave for saving the audio to a file before sending it to Watson's HTT service. So, I'm going to, let's call this speech-to-text.py, saves it here, and I'm going to import PyAudio, import, I need Wave, I need JSON for the APIs, I need IBM Watson library, you can say IBM Watson. So, IBM Watson, this library is used for connecting to IBM Watson's HTT service, and from there, I will import speech-to-text v1, that's the framework that I'll use, that I'll use from that framework, and then next one for authentication with my APIs. So, the IAM authenticator in IBM's cloud SDK is used to authenticate API request by providing an API key for identity. So, I am going to call that SDK underscore core, that's the framework inside that we have authenticators import, I mentioned IAM authenticators. So, these are the libraries that I want. Also, before I start, let's open terminal, cd desktop, cd code, I'm going to go inside that folder, I'm going to create a virtual environment, I'll call it Python minus m virtual environment, call it test IBM, and then I'll source it, source test IBM slash bin slash activate. So, I have that, clear it out, and I need to install my libraries, I'm going to say install, first I need IBM Watson, I need Pi Audio, and I need Wave. So, those are installed, once that is installed, I'm good to go. So, now, let's finish this application. Now, let's create variables for our API key, API key equal to you, you'll have to use your own API, my API key will not work for you. And this is speech to text, which is this API. So, I'm going to use that API key. And then I will create a variable for service URL equal to copy this, paste it here. Okay, perfect. So, I have my API key and service URL added here. Next, let's initialize the speech to text service. So, for that, I will create a variable called as authenticator, authenticator equal to, and here, I'm going to use IAM authenticator created and pass the API key to it. Next, speech to text variable I'll create. And this variable will use that speech to text V1, and it needs authenticator, authenticator equal to authenticator. So, that's the parameter it needs. Next thing I need is speech to text, set the service URL. So, I can take this dot set underscore service URL. And into this, I'll pass the service URL. So, this will initialize my speech to text service. Next, let's initialize the PyAudio. So, for that, I'll create a variable called as audio, PyAudio dot py, audio is the fun call, is the plausible call in there, initialize that. Now, I'll start recording. So, I'm going to give a text to my terminal saying that I'm recording here. And to capture the stream, I will create a variable and I'll say audio dot open. And this audio needs some function like it needs what format. For this, I'll create a format parameter here, format equal to PyAudio dot pa int 16. So, audio format of 16 bit, that's what, there's some other thing that I would might need is channels equal to how many number of channels that you want. I want mono, which is one single channel, then a rate of the audio. So, this is sample rate of, I'll say 16 kilohertz for speech to text. So, that is 16,000. Next is chunk. So, this is audio chunk size. So, we want to chunk the audio into multiple pieces if it's really big. And how long do I want to record? So, for that, I'll specify less to five seconds. What is the meaning of less? Doesn't take more than five seconds. Next one is wave output underscore file name. So, where is the output of that audio? Let's create a file name called as output dot wav. So, that's good. I have my format here. And now next thing that needs is channels, which is a channels parameter that we created. Next thing that it needs is rate. That is the rate we created. Next thing that it needs is input equal to true. So, providing the input true. And then finally, it needs frames per buffer. So, this is for the chunk size that we created, how much it needs to chunk it in. So, that's what I have created here. Frames, I'm going to set it to empty right now. And this, I'll be filling it up as and when I capture the audio. So, let me just go to next line. Move it up so it's easier for you to see. So, here I'm going to go capture the audio data. I in, in our range of 0 comma int rate divided by chunk, let me explain this in a second, record seconds. So, for each of that, so for whatever we are recording and for that chunk that we get, I want to put this data into frames. So, it will give me the data from here. I can say stream dot read that chunk and then add that to my frames. So, frame dot append data. So, this will create my array frames with all the audio that is recorded, which is divided into chunks of this size, this 1024 audio chunk size and add it in this particular array. So, that, that is done over there. Next, now once that is done, let us make sure we are saying stop recording after the recording is done. So, I'm going to say recording, recording finished and then stream dot, I'll have to stop the stream. So, that's stop stream function. I'll call it. Next thing I need to do is stream dot, I'll have to close the stream. So, for that, I'll close it and finally, I'll also terminate the audio, which is our pi audio class here. So, that kind of does our cleanup for stopping the recording. Now, let's save the recorded audio to a file. So, I'm going to say wf equal to wave dot open. This is another library that we imported and here I'm going to say wave output file and to that right, we will give the right permissions, wf equal to, sorry, I'll set the channels for this as channels. Next, I need to set the sample, sample width, which is sample width audio dot get underscore sample size and to this, I'll pass format. So, that's what it needs. Finally, another thing and a couple of things that I'll say is frame, I think set frame rate is what it needs, which is the rate and then I need to write the frames, write frame function, b, my size dot join frames and then finally, I'll close this. So, I have my wave created here. So, recorded audio to a particular file. So, we save that. Now, whatever we have, we open that saved audio file and transcribe using our final speech to text. So, for that, I'll say width, I'll call the width command open and I have wave output file name, rb permissions as audio, format audio file, something like that. That open, I'll say response equal to speech to text dot, the function is called recognize. So, speech to text dot recognize and inside that, I'll pass the parameters. First one it needs is audio, which is our audio file, which we just created. Come on, show me audio file. Next parameter is what type of content it is, content underscore type. In this case, since we are passing wave file, I'll say audio slash wave type. Next thing that it needs is the model. So, in this particular case, I will say English underscore dash US. You can change the model if needed for other languages, but this is the common one used for English broadband model. And then I will say dot get underscore result. Result. So, this will have my data. And now, once I have that, I am going to extract the transcription script from it. So, I am going to say transcript. So, whatever I get back, the response, inside the response, I will have an array of results. So, I am going to say, I am going to look for that array of results. Out of that, I am going to take the first item and inside that, there will be an array of alternatives. That is what I am looking for. First item from there, which will have my transcript, transcript. And then finally, I will say print just to see if it did right by us. I will say transcription is comma transcript. So, that is the entire code for our speech to text. So, we use the APIs, we did everything properly. I hope there is no errors, but let us see. The best way to find it is to run it. So, I am going to say python stt dot py. Hello, welcome 1, 2, 3. Came an error, something error here. Let us see. Oh, it is n channels. How many channels you want to use? Since we are using one mono channel, that was the mistake. Let us try it again. What is the meaning of life? Oh, another mistake. Oh, instead of s a m p. Clear, run it again. What is the meaning of life? Oh, this time it worked. It is recording finished. So, it is going to send it to IBM and get the data. What is the meaning of life? There you go, it transcribed properly. I can run it again. Testing 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15. It recorded for 5 seconds and then it kept on saying. So, it got until 9, 10, 11, 12. That is what it got and then it stopped. That is how much I could speak for 5 seconds and there you go. That is our completion of our first step which is speech to text. Next thing that we will do is let us get the data from Watson Assistant. So, this is a much simpler one. In this case, I am going to create another file and I will call it assistant.py. And inside this, I would need IBM underscore, I think I need some of the same ones. So, I am going to just copy from here. So, I need not this one, but I need assistant B2 is what I need. I need IBM Cloud Core IAM Authenticator. That is good. And now let us authenticate it. I will say authenticator equal to IAM Authenticator and pass it the key. So, I have right here. So, my assistant key is here. I am going to pass that here inside quotes of course. And then next thing that I need is assistant equal to assistant B2. I will have to specify the version that I am using. The current version right now is 2021-06-14. You can check it in the documentation. And then authenticator equal to pass the authenticator, something similar to the previous HTTP app that we read. Next one, assistant dot set service URL equal to in this particular case, I will get the service URL for assistant. This is it. Pass it here. Perfect. And then I will get the response directly here. Since I am just passing the text, I do not have to like record anything. I will probably merge everything together at the end. But for now, let us call the function message stateless. And inside this, the parameters that it needs is first one it needs is assistant ID that we got from when we created our assistant. So, that is right here, here draft environment ID, copy this and paste it here. Next thing that it needs is a message, a question. So, in this particular case, I am going to say the message type, message type is of the type text and then it needs the message in the text format. So, I am going to say text is what is the meaning of life? That was my question for the assistant. And the once I get, I have to say get result. Perfect. That will get my result in the response. Now, what is the reply that I get? Where is in this particular response? Again, from the documentation, it says it is an output generic at position zero first item and inside that I will have the text. So, let us say print meaning of life. What does it reply back? Let us check it out. So, let us run this. This is called assistant.py. So, I am going to run that and I say Python assistant.py, run this code and meaning of life is 42. This was what we had entered in our IBM Watson X assistant. If you launch the Watson assistant and look at it, let me go back actions. What is the meaning of life? And the answer was the meaning of life is 42. And that is what assistant replied back and that is what we see right here, the meaning of life is 42. So, now that leaves us with only one last thing, which is right here, text to speech. So, text to speech is much simpler again, because we do not have to record anything, we just have to play the sound. So, for that, I will be using, I will be writing that file into an audio file for now, but later I will show you how to even play it using your speakers. So, for now, let us get the data. So, again, same stuff. So, I will probably better off copy it from here, paste it here. This time I will use text to speech v1 and I need the authenticator, I will call it text to speech.py. And in here, I will use almost the same code. And I will say authenticator IAM, let us pass the API accordingly, which is text to speech, copy that, paste it. Next thing, I need the URL and assist of this one, let me just make sure it is here first, because the next line is text to speech, that is what I will call it v1. And just this just needs authenticator, authenticator. Then finally, I will say text to speech, set authenticator URL is this one or set service URL to this. So, that is set. And now, I do not need this part, I will say with open, I am going to open a file called as output-audio.wav file, give it the permissions wb and as audio file. And in here, I will get the response equal to text to speech dot synthesize. That is the function that I need to call and to that I will pass the parameters, reply that I get, except equal to audio slash wav file, that is what we get back. And voice that I want to pick, so there is a bunch of different voice options, I will pick a voice called as Allison, en dash US underscore Allison, there is Allison v3 voice and Allison expressive, expressive is more nicer. I am not sure if it is available though, let us see on the API, but I have used it in the interface. Let us say get result. And I will say audio file dot write, whatever response I get, dot content. So, that is about it. Let us go ahead and run the application. So, I am going to say python tts dot py. So, as I said, reply here, whereas I need to pass some text there. Phew, since we are testing text to speech, we did not get the, we are not getting the reply from the assistant. So, let us run that again. And now, go back to our code and see if the audio output file which was just created. Hello, how are you? So, there you go, that is what is recorded. So, this got us to a point where we have awesome speech audio and then got text back. We did the Watson assistant, when we gave some text and got the speech back. Now, to combine all of them together. So, I am going to create a new file. So, that is my file where I will put all the code together into one. I can rewrite it just so that it is easier for us to understand of all the step inside one. So, I am going to copy paste one thing at a time. I need the libraries pyaudio, wave, json. Also, I want to display sound audio. So, I am going to import simple audio as essay. You can probably make sure that that is available. Json, I am not using Json anywhere actually. So, I can get rid of that. Simple v1, I needed assistant v2. And I need text to speech v1. So, I am going to make sure those are available, authenticator. The first step is to, let us get the API keys here. So, this was the speech to text. So, I am going to get that speech to text API key. I am going to call it stt underscore. And this was stt underscore service URL. So, I have both of them available. Next thing, I need the assistant. So, we copy this and paste it here, get the assistant API key. Make sure you copy the entire thing. So, this was, let us call it WA API key. This one we will call it WA, what is an assistant, that is what I mean here. And then we need assistant ID that I can get from, where did I use it here? Assistant ID and put it here. So, I have that and then finally is test to TTS, which is text to speech, text to speech, which is this one. So, this is my, make sure I have the entire thing or else it will give me error. Have that and then the URL. Let us take that right here. So, I have that, perfect. So, I have all the three things, all the three APIs available here. Now, let us initialize the speech to audio text service, which was here. So, I can probably copy the entire thing just as it is, but I want to do it one thing at a time. So, let us go ahead. So, this one, the API key, pass that thing, pass it here. I call that, let us make sure we name it properly, HTTP authenticator. The first one, speech to text, perfect. And of course, I need this authenticator, speech to text, set service, set, perfect. So, that is all set. Next thing that we need is the authentication for what is an assistant. So, that, for that, I have done this, I will copy this here, bring it here. I will call it wAuthenticator, I am authenticator and need that API key, assistant equal to assistant v2 version, authenticator equal to wAuthenticator, set service URL is wA service URL. So, that is perfect. Next, let us do the same thing for text to speech. So, I will copy this over here, call it TDS, text to speech, set service URL again, here it is TDS service URL, this goes here and API key here. So, perfect. So, I have all three of them initialized. So, that is good. Now, audio recording formats, let us take that, let me just move it up all the way. So, audio recording parameters, which are here, so audio recording parameters, which are here, I am going to take this here. So, this is our format, I am going to also get most of the code from here. So, let us see what we need and what we do not need. So, I have my format set, I am initializing the py audio, I am, let us see, print. Now here, instead of doing this, since I want to run one thing at a time, I am going to create a record audio from microphone function. So, I am going to call it record underscore audio. So, it is much more simpler and reusable in future. And this goes under it. And then stream goes under it, frames, captures the audio, perfect. Stream close, next, RWFs, perfect. So, this all is good, but I do not want to use this one because I will probably pass these values from somewhere else, but I will keep it here at the bottom for now. Next thing is step two is using IBM Speech-to-Text to transcribe the audio. So, for that, I am going to create another transcribe underscore audio. And here is where I will be using this, put it in a different function. So, one thing at a time, we will run and then this returns my transcript. So, whatever it recorded or whatever it transcribed, it will get me that back. Now, next step is to send the transcribed text to Watson Assistant. So, for that, I will create another function. I am going to call it get underscore assistant underscore response. And to this, I will pass transcript, which I got from the above function. That is why it is good to have it in this particular format. And for this, I will just copy paste this one, response assistant state message. And here, instead of this, I will pass it this W Assistant ID. Perfect. Input is this. This is good. And then I will say get result. Perfect. Now, instead of this, I will call it assistant underscore reply equal to response. Is this what I am getting? I can print it here. And I can return the assistant reply to wherever it was called from. So, that kind of takes care of sending the transcribed text to Watson Assistant and getting a reply from the assistant. Next thing is convert the assistant response to speech using the text to speech library. So, I am going to say convert underscore text to speech. And I need for this, the assistant reply. And over here, I will copy paste this code with open assist. So, this is our audio. We can probably change it to something else. Assistant underscore response dot wave file, just so that it does not, it replaces, it has a new name there. I want response text to speech, synthesize. And in here, instead of passing this, I will pass the assistant reply, get the answer. Let us make sure everything is fine. Did I use this transcript? Nope, I did not. So, let us make sure here also we pass the transcript. Let us make sure it does not need input. It takes the input from the audio file. This is transcript. And this uses the assistant reply right here. And then, except audio wave, voice is this one, that is good, get result. And we get this audio file, which will write the response to. And finally, I will call a function called as play audio, which we have not written yet. And I will pass to it this file. Perfect. Now, step five, let us write that function to play audio. And it needs a file name. Now, in here, so wave underscore object that we have, SA dot, if you remember, this is our simple audio that we have imported. So, SA dot wave object dot from, where is it located, wave, I think it is called file, that is the function and to it I pass the file name. So, that is what I pass. And then I will say play underscore obj, another one, and I will say wave underscore obj dot play. This is how you play the sound. And then, when it is done, I will say wait underscore done. So, wait until it is done. And here, so once it is done, you can finish it, but I am just going to leave it there. And then now finally, I will create a main function. This is the main function to run the entire process. So, that is why I have created everything in functions. So, here, first step, first what do I want to do, I want to record the audio. So, what is the function for that that we created? It is called record audio. So, I will call that, I will say go ahead and record the audio. Next step, I want to get the transcript from that recorded audio, from which function was that? Transcribe audio. So, I am going to call that function. Perfect. That will return back the transcript. Next step, I want the assistant reply, which comes from get assistant response is the function that we wrote right here, and it needs transcript. So, I am going to pass it the transcript. And next step is to convert, what was that function? Convert text to speech, pass this and to that I will pass assistant reply. And then that will play the audio. So, everything is set here, rest of it. So, now run the main process. To run the main process in Python, you do name equal to underscore underscore main, and inside that I call the main function. And at the end, I will also do the cleanup, audio dot terminate. So, this will close the pyaudio stream resources. So, perfect. So, I have everything set up here. Let us see if it works from end to end. So, let us go ahead, clear it out. I am going to say Python complete dot py. And it has a problem. No, I will have to, I have install. Clear. So, now that is available. Let us go ahead and run it. But it looks like it is taking a transcription from somewhere from the previous project. We have record audio here, but it is not calling that. Oh, right here. Makes sense. It was not calling. Let us go ahead and try again. What is the meaning of life? So, I recorded it. What is the meaning of life? And it has listened to it. It is going to convert now. What is the meaning of life? And I have an error here. This error might mean that I am using some wrong APIs. So, I am going to clear it out, go back, and it is causing problem at the assistant level. So, I am going to make sure my assistant API key, which is this one, is right. Okay, that is good. Next one is the URL. So, it looks a little different from what it was. Maybe I have copy pasted out this previous one. So, that is probably is the problem. So, that probably should fix it. Let us go back and try again. What is the meaning of life? And it worked. It got me the meaning of life is 42, but it did not work with, oh, I think I have, did you mean test? Oh, I think I spelled it wrong. Test to speech, it makes sense. Text to speech, text to speech. So, perfect. So, that should probably work fine. Let us run it again. What is the meaning of life? So, we got the answer from, I was still talking. So, I am not sure if it worked this time. It gave me the answer, what is the meaning of life, but there is a problem in the object has no synthesize variable. Speech to text, we want it out there. Let us see where is that synthesize function that we write. Text to speech and this is speech to text, not this is wrong. I forgot to change the function name. Perfect. Let us run it one more time. What is the meaning of life? It says write to closed file, which is in my convert text to speech. So, it says that the file is closed. I forgot that this is closed. This needs to be here. Perfect. So, that should, because it is inside the with function, it should be inside the with loop, not outside. So, let us run it again. Hopefully, this is the last one. What is the meaning of life? Going to send it to transcription, then it is going to get my reply from what is the meaning of life from assistant. And that completes my entire, I have not closed the thing after the file, that is why it gave me that error, but it gave me the entire flow. I can run it again. What is the meaning of life? The meaning of life is 42. Perfect. So, it worked. That is what we wanted and we went through the

Speaker 2: entire cycle, where we recorded the voice, we transcribed it from speech to text, then WatsonX

Speaker 1: assistant got that text and it gave me the right answer, which is the meaning of life is from the assistant. And from that, I took that audio file and I sent it to the assistant. So, that is what is the meaning of life is from the assistant. And from that, I took that audio file or the text from that I got from Watson assistant and converted into speech using the TTS library from IBM. And then I played it on my laptop and that is what you heard. So, if you have any questions or if you had any issues following along, let me know and I will get back to you. Thank you.

ai AI Insights
Summary

Generate a brief summary highlighting the main points of the transcript.

Generate
Title

Generate a concise and relevant title for the transcript based on the main themes and content discussed.

Generate
Keywords

Identify and highlight the key words or phrases most relevant to the content of the transcript.

Generate
Enter your query
Sentiments

Analyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.

Generate
Quizzes

Create interactive quizzes based on the content of the transcript to test comprehension or engage users.

Generate
{{ secondsToHumanTime(time) }}
Back
Forward
{{ Math.round(speed * 100) / 100 }}x
{{ secondsToHumanTime(duration) }}
close
New speaker
Add speaker
close
Edit speaker
Save changes
close
Share Transcript