Understanding LibriSpeech and GigaSpeech Model Training
Explains the training and decoding process using shared models for LibriSpeech and GigaSpeech, highlighting data loaders and normalization differences.
File
Dan K2 32 Multiple Datasets in Training Next-gen Kaldi
Added on 01/29/2025
Speakers
add Add new speaker

Speaker 1: Can you please explain to us this diagram? Okay, so this is about how the training works. So, like I say, most of the model is shared, but there's parts of it that are specific to LibriSpeech and GigaSpeech. What this decoder is, is it gets the language model history of the last two tokens and it encodes it into a vector. So that's the kind of recurrent part of an RNNT, at least it's recurrent on the RNNT's output. So the decoder and joiner are specific to LibriSpeech and GigaSpeech. We have two separate data loaders, we don't combine the two data sets into one data loader, because we need to feed them to the appropriate head and it would be quite inconvenient to split apart the data within a mini-batch to one head or the other. So what we do is we create two data loaders, and I think the way it works is the epoch iterations are over LibriSpeech, and then we just continuously get data from GigaSpeech and when it runs out, we just restart the data loader or we put the data loader in some kind of mode where it does that. Okay, also, can you explain to us how to decode using GigaSpeech decoder? That's the kind of thing that people had better create an issue on GitHub to ask about it, but my guys know about this, but I'm not sure of the specifics. Okay, can you explain how the GigaSpeech model is trained? Okay, so our GigaSpeech model wasn't just trained on GigaSpeech, it was trained on LibriSpeech and GigaSpeech. Now it's difficult to combine those two data sets because they're normalized quite differently. GigaSpeech has things like comma and period as separate words. So what we did is most of the model is shared, but then there's two heads of it. One is for LibriSpeech, one is for GigaSpeech. So that's how we trained and this API, this is with the LibriSpeech head of it. So it'll output in a way that's normalized like LibriSpeech, it won't give you period and comma and stuff like that. My suspicion is that probably there's not a ton of difference between what the two heads output, because most of the model is shared, but I don't know that for sure. The underlying models should support decoding with GigaSpeech, you just have to pick the right top part of it, but I don't think this API supports the GigaSpeech head. Thank you for watching.

ai AI Insights
Summary

Generate a brief summary highlighting the main points of the transcript.

Generate
Title

Generate a concise and relevant title for the transcript based on the main themes and content discussed.

Generate
Keywords

Identify and highlight the key words or phrases most relevant to the content of the transcript.

Generate
Enter your query
Sentiments

Analyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.

Generate
Quizzes

Create interactive quizzes based on the content of the transcript to test comprehension or engage users.

Generate
{{ secondsToHumanTime(time) }}
Back
Forward
{{ Math.round(speed * 100) / 100 }}x
{{ secondsToHumanTime(duration) }}
close
New speaker
Add speaker
close
Edit speaker
Save changes
close
Share Transcript