20,000+ Professional Language Experts Ready to Help. Expertise in a variety of Niches.
Unmatched expertise at affordable rates tailored for your needs. Our services empower you to boost your productivity.
GoTranscript is the chosen service for top media organizations, universities, and Fortune 50 companies.
Speed Up Research, 10% Discount
Ensure Compliance, Secure Confidentiality
Court-Ready Transcriptions
HIPAA-Compliant Accuracy
Boost your revenue
Streamline Your Team’s Communication
We're with you from start to finish, whether you're a first-time user or a long-time client.
Give Support a Call
+1 (831) 222-8398
Get a reply & call within 24 hours
Let's chat about how to work together
Direct line to our Head of Sales for bulk/API inquiries
Question about your orders with GoTranscript?
Ask any general questions about GoTranscript
Interested in working at GoTranscript?
Speaker 1: With the rise of use cases such as chatbots, virtual assistants, and increasing accessibility, applications using Automatic Speech Recognition, or ASR, are becoming more and more popular. In this video, we're going to be talking about automated speech recognition on Google Cloud with our very own Speech-to-Text API. Speech recognition models are great for generic, broad use cases, but as you start to get into specifics, certain nuances with domain-specific knowledge and proper nouns are not always captured by out-of-the-box models. Hi everyone, I'm Aneeshwari Vastava, and in this video, I'll walk through how you can boost ASR accuracy without having to train your own custom model. We'll be using the Speech-to-Text API, which allows you to accurately convert audio to text. Some of the highlights of this service are multi-speaker detection if you have a recording of several people conversing, streaming speech recognition, profanity filtering, and support for over 125 languages. Fun fact, it's available on-prem as well. To start, let's run a sample with the API to see how it works. Here in my notebook, I have a code sample set up that will transcribe the audio, then print out the results. First, I import the client library and create a client to call the API from. I grab the sample audio file, which is stored in Google Cloud Storage, from the bucket URI. Let's set a few configuration variables and call the API. Let me play you the sample audio file. Hi everyone, my name is Anu, I live in New York, and I work at Google Cloud. Now I'll run it in my notebook, and we have the transcription. Now that we have the transcription, how do I know the quality of the output? What if I need to automate bulk transcripts for a business-critical need? Here is where we need to define speech accuracy. This is especially important once we start working with audio transcription at scale. One standard to do this is called Word Error Rate, often abbreviated as WER, and this is a formula. Word Error Rate measures the percentage of incorrect transcriptions in the entire set. So this means that the lower the Word Error Rate, the more accurate the system. Word Error Rate is a combination of three types of transcription errors which can occur. First, we have substitution errors, words that are present in the hypothesis and ground truth, but not transcribed correctly. Then deletion errors, words that are missing from the hypothesis, but present in the ground truth. Then we have insertion error, words present in the hypothesis transcript that are not present in the ground truth, aka the correct answer as defined by us. When we're doing this at scale, there could be several imperfections in our transcription, which is why we have some handy tips and tools to help you improve accuracy. So customize the model to your domain by providing contextual information. So let's say you are creating a bot that allows people to order pizza. So you might want to increase the probability that words like pepperoni, olives, and mozzarella are recognized. Then tweaking the weights to address specific word phrase issues. So say you're trying to recognize proper nouns or rare words or even made up words. It's unlikely that these will be transcribed correctly initially, so biasing towards them can fix individual terms. Using context bias towards specific types of information or words. So let's say we have an IVR telephone system and you have just asked someone for their order number. You can bias specifically towards an alphanumeric entry. Let's check this out in my notebook. So here I have some code that pulls a longer recording of me talking about my favorite restaurants nearby. So I've recorded this and uploaded this to a Google Cloud Storage bucket. Let me play the audio sample for you. Hey everyone. My name is Anu and I live in New York City. My favorite restaurant is Estella. It's on Houston Street in the Lower East Side neighborhood, which we like to call LES. I also love cuisines from around the world, such as at places like Sichuan Mountain House and Balad. Here we will run the sample through the API. Then we implemented a helper to calculate and display the WER for us. In this sample, we actually have the ground truth or the transcript for our accuracy calculation written out up here. Let's see how we did. Pretty good. But I think we can boost the accuracy and lower the word error rate. How can we lower the word error rate? Well, we can do this by using the Speech Adaptation API. The Speech Adaptation API allows users to pass phrases and associated weights directly to the Speech API. These phrases can be changed with every request and allow for both quick iteration and on-the-fly adaptation. All you have to do is include the terms and the request itself as part of the recognition config. Here's an example of what a simple adaptation would look like. How should you form your adaptations? What are the right terms to provide? There are a few things to consider. First, what am I doing with this transcript? Is there a downstream system that will be sensitive to particular words or phrases? These words should get a bias towards them. Are there rare or proper nouns? These should also get a bias boost. What contextual info can I use? Do you know what words someone might say or what they said in the past? These can be biased towards to help increase accuracy, even on commonly occurring words, if you are sure they will be present. Do you have strong or weak context? You can bias heavily with strong context if you are sure the user is about to mention some specific words. You should bias less if you have weak context, meaning you know that words will occur but not exactly when or where. Let's go back into our demo sample. We're talking about restaurants, and we have some proper nouns for the locations. Let's try computing the WER score again with these boosted phrases. Voila. We were able to get the word error rate down from our previous score. With a longer transcript, you can keep iterating on this to fine-tune your results. There you go. Here's one way you can boost your accuracy on pre-trained models without having to build a model from scratch. We hope this was useful, and we can't wait to see what you do with the Speech-to-Text API. Thanks for joining, and check out the lab linked below to try boosting speech accuracy yourself.
Generate a brief summary highlighting the main points of the transcript.
GenerateGenerate a concise and relevant title for the transcript based on the main themes and content discussed.
GenerateIdentify and highlight the key words or phrases most relevant to the content of the transcript.
GenerateAnalyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.
GenerateCreate interactive quizzes based on the content of the transcript to test comprehension or engage users.
GenerateWe’re Ready to Help
Call or Book a Meeting Now