Developing Speech-to-Text with C Sharp SDK
Explore C Sharp SDK for speech-to-text applications. Learn integration with REST API, device setup, and custom audio input in multiplatform environments.
File
Microsoft Azure Cognitive Services Speech to Text SDK
Added on 01/29/2025
Speakers
add Add new speaker

Speaker 1: Let's learn how to develop for the speech-to-text service. In this module, we will first explore the SDK by looking at the different classes and how they interact to create a speech-to-text application. We will focus on C Sharp examples. There are multilingual and multi-platform SDKs available. We are going to focus on C Sharp due to time constraints. After we look at the SDK and go over some C Sharp examples, we will create a quick speech-to-text application using the C Sharp SDK. After the quick C Sharp demo, we will go over the REST API and cover the how to integrate with the speech-to-text service through a simple HTTP REST API. After that, we will have a demo using that REST API in Postman. Let's get started with the speech-to-text SDK. The only capability provided by both the REST API and the SDK client libraries is the ability to transcribe an utterance that is less than 15 seconds long with no interim results. LUIS intents and entities can be derived using a separate LUIS subscription. With this subscription, the SDK can call LUIS for you and provide entity and intent results as well as speech transcriptions. With the REST API, you can call LUIS yourself to derive intents and entities with the transcribed text. The SDK capabilities provide the developer functionality for the speech-to-text service. As previously stated, you can transcribe a short less than 15 second speech segment with either the SDK or the REST API. With the SDK, you can transcribe longer utterances, longer than 15 seconds, and transcribe streaming audio. The SDK allows for the integration with the Language Understanding Intelligence Service, also called LUIS, so that intents and entities can be derived from the audio being sent. For more information on LUIS, check out Getting Started with Building Bots with Microsoft Spot Framework by Matthew Kruszczak on Pluralsight. Throughout this course, we will use the C Sharp SDK as the reference for the SDK functionality. The other supported languages have a similar interface to the C Sharp SDK. The C Sharp SDK also runs in multiple environments. It supports the .NET Framework on Windows and has multi-platform support for .NET Core. In addition to those, the C Sharp SDK also supports the Universal Windows Platform and the Unity Engine. Now let's take a look at the C Sharp interface for speech-to-text. Before doing anything else, there needs to be a speech configuration created. For ease of use, a speech configuration instance can be created with the fromSubscription helper method. The only parameters needed are your subscription key and your service region, both of which can be found in the Azure portal. Once there is a configuration created, we can create a recognizer. For speech-to-text, a speech recognizer is all we need. Using the default constructor with the speech recognizer, the default operating system microphone will be enlisted when the recognizer is invoked. You can use audio from more than just the default OS microphone. You can use audio from a file as your input. It's as simple as creating an audio config from a supported file type. Then, when the recognizer is constructed, the audio config is passed in as the second parameter. Now, when the recognizer is invoked, it will use the audio file as the input instead of the default microphone. Not only can you use the default microphone, you can also choose your microphone input. Using the fromMicrophoneInput helper function, you only need to have the device ID. I should note, at the time of this recording, this functionality is not yet available from JavaScript. Getting the available device IDs differ from platform to platform. For UWP, it is very simple to get the collection of audio capture devices. Use the device information class from the Windows devices enumeration namespace. It has a method named findAllAsync that takes the device class enumeration as a parameter to filter the devices for you. Once you have the collection of devices, you need to select the device you want to use, and that object has an ID property, which is the string that the fromMicrophoneInput method needs. Here is an example audio device ID for UWP. For retrieving the audio devices in Windows, you can use the inAudio library to simplify the retrieval of devices. After you import the library, create the mmDeviceEnumerator, and then use that object by calling the enumerateAudioEndpoints. The collection returned contains objects with the ID property we need. Here is an example of a Windows audio input ID. For Linux, the device IDs are selected using standard ALSA device IDs. The IDs of the inputs attached to the system are contained in the output of the command aRecord-L. Alternatively, they can be obtained using the ALSA-C library. Here are some sample device IDs for Linux. Audio device selection with the Speech SDK is not supported on iOS. However, apps using the SDK can influence audio routing through the AV Audio Session framework. For example, this instruction enables the use of a Bluetooth headset for a speech-enabled app. The Speech SDK's Audio Input Stream API provides a way to stream audio streams into the recognizer instead of using either the microphone or the input file APIs. There are two types of input streams, push streams and pull streams. Push streams allow for writing to the stream, whereas pull streams implement methods to pull data from a stream. Once created, both the pull or push streams are used as an audio input for the recognizer. A custom audio input stream can be created to encapsulate your audio input to interface with the Speech-to-Text SDK. The first step is to identify the format of the audio stream. The format must be supported by the Speech SDK and the Speech service. Currently, only the following configuration is supported. Audio samples in PCM format, 1 channel, 16,000 samples per second, 32 kilobytes per second, 2 block a line, 16-bit including padding for a sample, 16 bits per sample. The code sample on the left is what it would look like in the SDK to create the audio format. The next step is to make sure your code can provide the raw audio data according to these specifications. If your audio source data doesn't match the supported formats, the audio must be transcoded into the required format. Once you have verified that your audio can meet the format specification, then you can create your own audio input stream class derived from PullAudioInputStreamCallback. Implement the read and close members to handle the life cycle of events for the custom input stream callback. The exact function signature is language dependent, but the code will look similar to this. The final step once you have created the custom class is to create an audio configuration based on your audio format and input stream. Pass in both your regular speech configuration and the audio input configuration when you create your recognizer. In this case, the Contoso audio stream is used as the first parameter in the FromStreamInput method to create the audio config that is eventually used when we create the recognizer. Previously, we looked at code that created speech configuration and immediately passed them to the recognizer. Before consuming the SpeechConfig object, the object can be edited. In this example, the speech recognition language and the output format are set before it is used to create the recognizer. The fields that can be edited on the SpeechConfig are the authorization token, the input ID, the output format, the region, the speech recognition language, and the subscription key. We have covered the steps to create a recognizer. Now it's time to use that recognizer to convert speech to text. To transcribe a short utterance, all that is needed is to invoke the recognizer once async method on the recognizer object. Once the transcription is complete, the reason needs to be checked to handle application flow. The reasons relevant to the application flow for speech transcription are recognize speech, no match, and canceled.

ai AI Insights
Summary

Generate a brief summary highlighting the main points of the transcript.

Generate
Title

Generate a concise and relevant title for the transcript based on the main themes and content discussed.

Generate
Keywords

Identify and highlight the key words or phrases most relevant to the content of the transcript.

Generate
Enter your query
Sentiments

Analyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.

Generate
Quizzes

Create interactive quizzes based on the content of the transcript to test comprehension or engage users.

Generate
{{ secondsToHumanTime(time) }}
Back
Forward
{{ Math.round(speed * 100) / 100 }}x
{{ secondsToHumanTime(duration) }}
close
New speaker
Add speaker
close
Edit speaker
Save changes
close
Share Transcript