Exploring Azure's Speech Cognitive Services
Dive into Azure's Speech Cognitive Services, covering speech-to-text, text-to-speech, and custom speech models, enhancing app development.
Azure Speech Service Azure Cognitive Services Microsoft Azure Text to Speech Speech to text
Added on 01/29/2025
add Add new speaker

Speaker 1: Hi everyone. Welcome to Innovation Tech Community. Today we are discussing about one of the cognitive service in Azure. So that is a speech cognitive service. So let's move on to today's topic. Yeah, before we are starting the session, I would like to introduce myself. Yes. So, first of all, I'm Pratap Reddy Chenchappagari, I'm working as Associate Consultant at AVDN Netos company and I'm completed a 6x certifications have won the lots of national awards and if you have any doubts regarding any latest technology or whatever it may be so your career guidance or anything please feel to reach out me in the LinkedIn page I will ping you my LinkedIn account in the chat and as well as in the description of the YouTube channel okay and then so today we are discussing about the one of the cognitive service of Microsoft Azure, so in that Azure we already discussed about discussing about the face vision and the previous video and today we are discussing about the speech so what are the subcategories of speech in the cognitive service let's move on to speech to text and the speaker recognition and text-to-speech and speech translations so there are the different types of speeches are there so in that we are discussing about these are these all services so in a speech recognition they are the three transformations that is ingest and transform and transcribe and enrich and serve so these are the main forms of the speech so first for example we have call recordings and the call recordings move want to call regarding blob so that can be stored in the one blob storage and then it will be posted to the function app that can be train a custom speech model so that can be trained for the custom speech model so whether it is the speech service batch API or custom speech model so we have that can be analyzed whether it is from the speech service batch or a custom speech model and then it will be want to get that is the function and that that will be transcribe okay one second so in that one that a batch ABA transcript blob it will be there so the result will be queued and then the like it will be enriched and serve as a language service okay for example it will be like a call transcript blob or transcription insights blob so whether if the both are will be like a trans call transcription block moved to language service these language service can be analyzed the whether the call regarding from the which language that can be analyzed that can be transcription inside block so that can be moved to transcription so so for example this transcription we need to in the visualization form so for example so we need we need in the visualization form we can create the Power BI visualizations or web application so we need to create okay one second so these are all about the basic idea about the speech service in the cognitive services so as we all know about the what are the cognitive services of APIs that is speech, language, knowledge, search, vision and machine learning so these are the APIs of the Cognitive Services. ok let's move on to Azure portal so we all are know about the what is the Azure portal domain. portal.azure.com is the official domain of Microsoft Azure. Once it will be appear like this, you need to login with your credentials so I need to login with my free trial credentials one second so once we have to login with your credentials it will be appearing like that and then you just search it like search like Azure cognitive services, one second, cognitive services so you can observe that Azure Cognitive Services are here and then so once we preview the Azure Cognitive Services there are the different type of cognitive services that is Azure opening AI and speech and language and vision and decision and multi-purpose so these all are the different types of cognitive services are provided by Azure I mean Microsoft Azure to users okay so in this way we have to develop our apps and web applications and everything that can be developed in this model okay so right now I would like to move forward with the speech service okay so first of all we need to create the speech service so I mean we need to move forward with speech we need to create one speech service okay let's move on to create one speech service so I'm already created one speech service so I would like to tell you how to create the speech service and everything so so whatever the service we have we need to first of all we need someone subscription so I have the free trial subscription so I need to go through this one whether it will be exhausted you need to go for the pay as you go and and then one resource group is there so you need to create i have already there are two resource groups i need to move with the one resource group or else so you don't have any resource group you need to create one resource group instantly and then so you need to select the region according to the your standards for example they are the different types of we are already know that so in 140 countries 60 regions are there I mean as in all over the world so 140 can be 60 regions are there in the 60 regions we need to select as per your standard so so I have to select the central India or central sorry I am selected the Canada central so I need to select the central India so that is belongs to our India so and then I will give one name for the speech service so I'm already given this is speech one for so again this one and one more time so the whether the name is unique one that is not matched with any other unique names okay and then pricing tier will be free or standard so you need to select anyone and then move on to next network and then identity so make it as everyone as a default one and then tags and review so just to review the all the services so that can be validation so once we have to go through the all networks identity and tags and review create so once you go through the all services so it will be generated like a validation is possible once validation is first you need to create the speech service okay so I don't want to move forward why because I'm already created the speech services so and then I would like to go into services and in the view option I have already one speech services there that is speech 1.4 I have already created that is in east US location I need to go through this one okay so once we have to reach the speech preview so you need to have one preview is that he's go to speech studio okay so these are the different type of key keys are there you need to copy one key okay so it may be helpful on it in the future okay that's why I'm copied this key okay and then the reason is East US and endpoint is this okay I would like to move forward with the go to speech studio okay yeah this is the interface of whether it is the speech studio so in before the session we are discussing about the vision studio that is the vision studio we have discussed about the vision how to I mean whatever the vision applications is there computer vision or whatever the situation I mean whatever the features are there we need to discuss about the vision studio and now we have to discussing about the speech studio in the speech studio we have two different types of services are there we need to go through the single I mean each by each I mean step by step you need to go through this okay speech um there are recent customer we have worked on the projects are attached to server the create them you can fill the list of projects services and navigate to the home page for the service okay that will be default one so you need to explore and try out and view the people sample code of common cases and Azure speech service features like speech-to-text and text-to-speech so whether this speech studio will be like two different forms that is the speech-to-text is there and text-to-speech is there so they are the different type of speech capability by scenarios are there so so they are the two different demos are given by the Azure speech to you so let's try out each one captioning with speech to text so captioning with speech to text means so so we have one videos are there so just right now I'm speaking with you guys so this can be subtitled by whatever I'm speaking that can be subtitled with the proper language like whether it is from English or it is French so which whatever the language that can be transcribed into one universal language like English okay so there are that two samples that is a real-time captioning and offline captioning so offline captioning real time caption we have to go through this that is this is one video I don't know whether it is audible to you guys or not let's see the subtitles are below the every video The speech will be comes from the aeroplane because this is one of the real-time scenario of the Microsoft provided by the and captioning settings. So let's see. is it for recognize an event real-time more written stable partial results that is the real-time stable partial result threshold is 3 and the maximum caption length that will be 60 and the maximum number of caption lines that is 2 captioning lines means I mean lines from the first line and second line okay and masking will be profanity mask profanity marks and facial list neural and TT cognitive services okay let's move on to or you can just move on to offline captioning in the offline captioning and though we have one video about the Cosmos DB oh it can be generated like so it can be captioning the settings like recognition event is offline mode and the captioning link is 60 and a maximum number of caption lines is one and profanity masking is masked and the phrase list is Cosmos DB and NoSQL I mean so it can be detected which scenario it was it can be generated like that and then let's move on to captioning will be we have to cover and post call transcription let's move on to this one hmm so in this one we have the sample data so that is will be in the audio form that is audio to text so it can be generated from that is speech to text okay so this scenario will be apply for loan a customer is your initial call to inquire about the applying for a loan so the call is regarding this applying for home that can be in the speech form that can be converted to speech to text okay so that can be let's see it will be yeah this audio will be in from the speech to text so these all are the speech to text will be transcription will be like that okay so I would like to upload my own data so my upload my own data to how it will be transcribed from the speech to text okay let's see the speech 14 and select one resource group every whatever it may be okay acknowledge it and browse files I would like to Second, so I'm recently I've uploaded one data. I think it will take some time. So I would like to download this form, save from it online, so we have to download your 10 minutes one second I just copy this I need to download mp3 format because this is the speech service and unknown format let's see

Speaker 2: so I need to download again you so I would like to move forward because

Speaker 1: yeah second place so it will it will be downloaded so after it will be downloaded I would like to try once again and then I will be move forward that is speech to text so real time speech to text how it will be work uh so this is the okay I would like to load this data okay let's see what happened hmm so I uploaded this data it will generated the text or not we can observe yes so that so this can be generated the text like this that is a speech-to-text real-time speech-to-text yes I'm uploaded this data sample files and then it will generated the text from the speech so it is the real-time scenario I am in real-time demo from the speech to text So it will be generated by using this playback video So I uploaded the speech. So that's why it will be generated like this So once it will be generated you need to go through the you just play the clip so we need to Generated the what is the exact meaning inside it yes everything will be captured from the speech analysis and then move to another one so it will be a honesty it will be generated from the speech to text so i'm giving the speech from the scrap and it will be generated the text so it will be generated 100% and then move to custom speech so in that welcome to Welcome to custom speech means it improves the speech recognition accuracy of microsoft speech to text for your target scenarios Use your own domain specify the vocabulary data pronunciation data or audio samples recorded to your target and Austic environment. Okay, so you have in Data in the pdf format or text format that will be our audio data So whether it is text data or audio data so that that will be moved to train with a custom speech and then it will be generated like a custom speech to text so so you have a data whether it is raw data like text data whether it is like a speech data so speech will be like in audio data so that will be trained by custom speech custom speech is from Azure Cosmos DB I mean Cosmo custom I mean cognitive service sorry for that credit service that will be moved to finally it will be generated like a custom speech to text ok so everything that will be evaluate define targets or prepare your data and train model and integrate it will be from the step by step custom speech model ok it will be like that and then move to pronunciation assessment so ok we have to try one pronunciation how it will be work it it will be 100% accurate or not, okay? Oh, yeah. Today was a beautiful day. We had a great time talking a long walk outside in the morning. The countryside was in full bloom, yet the air was crisp and cold. Towards the end of the day, clouds coming, forecasting much-needed time So This is the accuracy. I have given the 90 89 percent of accuracy of pronunciation score So accuracy score is 92 percent and fluency score is 85 percent. I am NOT the 100% fluency whether I'm Speaking with this one. I mean Using this scenario Okay, this is how This pronunciation assessment will take place in whether we have to go for the competition competitive exams and so middle platform it will be like Pronunciation assessments are there. So these all are be I mean created by using this pronunciation as I mean speech studio I mean pronunciations speech studio, okay, and Then we can move forward to text-to-speech speech okay in the text-to-speech we have a different three types of speeches are there in that so we have the different type of sample data so that will be I mean generated from the text-to-speech I mean we have the raw data in the text format that can be moved I mean like a speech you know I like it will be more the text format to speech format okay we need to open to quick start it is not mandatory so I need to move forward to text format custom voice okay so this is the custom welcome to the custom voice it will be processed like a voice recordings and transcripts so we have voice recordings and transcripts are there so that will be trained by the neutral voice and it will be generated output like a synthetic voice for your brand okay so this is for the custom neutral voice neural voice okay oh this can be transcripted like this so they are the different types of it will be step by step processes will be applied for the access and design voice and prepare a script and record voice and train voice and integrate so these are the step by step process of the professional custom neural voice okay and then so I need to try one project it will take one second okay I think it is not is full right now so it is asking for access limited access so we need to go back and let's try audio content creation in the audio content creation we need to upload some more I mean raw data like it is from the text to data like that so I already uploaded one environmental data so in that they are created a different type of data that will be content will be like that so I need two tasklets are there so I need to go for the audio library okay and then task library I need to import the SSML files and download the report one second so once it will be downloaded Please open this one the totally one files are imported from the audio content to creation. Okay One second guys, I would like to

Speaker 2: Inform Stories

Speaker 1: We need to upload the files, whether it is a folder file, a text file, whatever it may be. In a text file, you have to upload. Okay. I have already uploaded the involvement one. So I need to go through this, save, all files. see whether it is successfully up created and I need

Speaker 3: the second nice

Speaker 2: standard so one could not access here one second Just give me a second, guys. speech is there i think it will not work right now so it will take some time i'm unable to

Speaker 1: showcase in front of you so i would like to tell you one more time so now we are move back to speech studio so why because it will what what happened means I will like to tell you so we have to upload one type of scenario like we are from the data will be like I mean the raw data will be in the text format that text format will be generated from the text to speech okay we We have to transform, previously we saw that we have one speech data, the speech data converted into text format, same thing. So we have text data that will be converted to speech. So that is a crisscross one. So whether text to speech and speech to text. So that's why it will be generated like that. And then we need to move to custom keywords. In this one, we have to upload some kind of data that will be. So I have already created this project and given the description like I don't know. So this I don't know is text format. So we need this I don't know in speech format. will be generated one voice command like I don't know it will be created let's see so this is the speech one one second it will be processed so test models tune models model first of all we process this data and then we create a new model so I need to create one new model okay name of the model will be speech tool I do know I choose able to describe I love cognitive services okay input keyword love you okay and then next one second so I would like to play this video so whatever the situation I have given in pronunciation I would like to whatever the input I have given that will be played here okay so then the both audio so they're given like the pronunciation will be love you so why because in this name and keywords I will given the input as a I love you okay okay level level we have to select whether it is the which level it was so basic level or advanced level so I have to select like basic and then create okay so once it will be completed it will train like first it will be status will be it will be active and then we move to test model and the tune model okay and then I go back to speech studio we have in the voice assistant so we complete the custom keyboard and and then move to custom commands. We need to create one project I had to given like name will be speech 1234 and description iLiveCognitiveServicesLanguageVSModel and create Speech111 okay you have to have to create a LUIS authorization resource okay you need to create one. Once it will be completed we need to go for the custom commands yeah as project will be completed okay you need to go through this one because we have to created the speech 1 2 3 4 and we have to train the model ok otherwise you can give or add your commands as per your input so as per your standards so you have the input data you need to add the data ok so this is the help me what can I do so how can I start hello hi these all are the input data given by the azure ok so this can be trained like a a default one okay and then you can give the command like hmm so cannot to give the no spaces example demo I can give a create hello

Speaker 2: how are you

Speaker 1: oh how about you so I need to given this one I have to save this one and then train the model okay let's see it will be trained it will take some time okay it will be trained and then then add condition so you need to complete the rules and the test publishing application for testing okay hi I can give the command

Speaker 2: how are you

Speaker 1: okay so we add your feedback message he means so so once you have created the lots of web applications so this is like a chatbot so if you do on the command so the user given the command like hi so you need to add your feedback message what we have to given the given for so whether it is so for example when the person a given command like a hi the developer you how which message has to be reflected back that is feedback message will be hi how are you so I given the like that okay so and then hello I can give him so test your application hello means hi how can I assist you today how can I assist you today so these are these are the feedback so once the commands from the user these all are from the command side I mean user side so once the user will give this this type of commands okay so we have to given the feedback to whatever the feedback we have to given from the developer side okay that will be generated by automatically for so once we have to deploy into the model okay and then publish so once we have to test everything so and then we have to publish the model so it will be published published application file because I am not created a fully I mean I'm not very satisfaction satisfactory I I'm not gonna fulfill why because I'm not created everything okay so this is the speech studio about the custom commands so custom commands means so we have developed the lots of web applications so then the web applications they are now assistant for example assume the chargbt so we need to ask for the chargbt from the different type of commands like what is the distance from the Madurai to Chennai or Chennai I mean Kanyakumari to Himalayas we need to ask commands like that so the so that OpenAI platform will be created the feedback messages from the distance from the Himalayas to Kanyakumari is approximately 3000 kilometers like that it will be created the feedback message that's why we are using this type this type of speech commands or text commands from this Azure model that's why OpenAI also build from the Azure portal okay so this is these all are from the speech cognitive speech services okay and speech studio so today we are discuss I mean today we are learn about the speech from the how I mean captioning with the text so we have the one video so the in video the person will be know which language will be detected automatically updated and post-call transcription whether it will be so we have gone through with the play for the loan whatever it may be that audio will be generated with the text format and we have a real-time text format and custom speech is there and pronunciation assessment so we have to pronounce with clearly, loudly 100% it will give the accuracy and voice gallery is there we have input format from the Azure and the custom voice is there and it will be used on the audio recordings to create a distinct and one of the kind of voice to text-to-speech so that is the voice to text it will be from the whether it is from the raw data to text format text format to speech ok audio content creation audio content creations means pronunciation of your spoken content so whether which language you are speaking with that it will be contained from the text-to-format ok and then voice assistant it will be custom keywords or custom comments custom comments we are previously talking about whether it is the custom commands like chargbt where we given the commands like chargbt how it will be given the feedback to us okay so this is these all are about the speech studio about today's session so I would like to move forward about the different type of sessions are there I mean cognitive services are there where completed only vision and speech today so tomorrow we are discussing about the knowledge and different type of cognitive services okay thank you guys Thanks for attending this session.

ai AI Insights

Generate a brief summary highlighting the main points of the transcript.


Generate a concise and relevant title for the transcript based on the main themes and content discussed.


Identify and highlight the key words or phrases most relevant to the content of the transcript.

Enter your query

Analyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.


Create interactive quizzes based on the content of the transcript to test comprehension or engage users.

{{ secondsToHumanTime(time) }}
{{ Math.round(speed * 100) / 100 }}x
{{ secondsToHumanTime(duration) }}
New speaker
Add speaker
Edit speaker
Save changes
Share Transcript