Introducing Torch Audio 0.30: Powerful Audio Processing

Convert Your Audio To Text

4.9/5

3720 customer reviews

Discover Torch Audio's new features like inverse short-time Fourier transform and resampling, leveraging PyTorch for efficient audio processing and machine learning.

Torchaudio 0.3 with Kaldi Compatibility, New Transforms A Quick Introduction by Jason Lian

Added on 01/29/2025

Speakers

Add new speaker

Speaker 1: Hi, I'm Jason. I'm an intern working on Torch Audio. And for the last couple of months, we've been working on building the release for version 0.30. Another engineer that was working on Torch Audio is Vincent. So for Torch Audio, what we want to do is we want to use PyTorch, but inside the audio domain. And the purpose of doing this is we leverage PyTorch's features and performance and machine learning. So what this means is we get paralyzed processing via GPU. And you can also save your models and optimize them. And once you save the models, you can load them in processes where you don't have to use Python. So what we added in the version 0.30 is we added a bunch of new features, such as inverse short-time Fourier transform and resampling. Here is a brief overview of what you can do with Torch Audio. We have I-O transforms, as well as a Calde support. So for I-O, what you can do is you give us a file name, and we can load a tensor from it. In addition, you can take a tensor and save it to a file. And we support a wide variety of file formats, such as MP3, FLAC, WAV files. And also with I-O, we can load data sets very easily. You just write a couple lines of code. For R transform, what we do is we have neural network models that can provide you signal processing functionalities. So what this means is we have spectrogram, MFCC, and resampling. And these are all implemented in PyTorch. So you get paralyzed processing, as well as JIT support. The last feature of Torch Audio is Calde support. So if you're not familiar with Calde, it's an audio signal processing library that's written in C++. And so what we do is we write the same functions, but we write it in PyTorch. We can also read and write ARC files, which are Calde files that are similar to CSV. It's like how you store data. So here's a small code snippet of what you can do with Torch Audio. As you can see, we're given a file name, and we load a tensor. And we also get the sampler rate. So you can see this on the left diagram here. We also take this waveform, and we run it through a spectrogram. And yeah, so we run it through the spectrogram, and we give it an input parameter of number of Fourier bins. So you can do this with all our transforms. So you just give it a couple of parameters, and you can modify how the neural network model does it. And once you run it through this transform, you get this output. And this is a tensor of a spectrogram. OK, so here's another code snippet of what you can do to replace your Calde binary. You take the waveform that we read from file from before, and you compute a spectrogram here. And what this means is it has exact parameters that Calde binaries have. The only difference is you're running it online instead of using a binary. OK, so a practical application of Torch Audio. So what we did here is we took Shakespeare's Hamlet, to be or not to be. We took the audio file, and we did voice activity detection. So we segmented the file based on when the person's speaking. And we run it through the Calde FBank, and you get this tensor below here. And then using this normalized FBank, we can transcribe the audio. As you can see here, it's, yeah. There's also a live demo that you can see at our booth. So it's like a person reads into the microphone, and then we can transcribe it. OK, so this is our URL, PyTorch.org slash audio. So there's a bunch of tutorials, documentations. You can also find our GitHub page. And based on that, you can also start contributing new features. Yeah, so as I mentioned, we recently released version 0.3, and we encourage you to try it out. Thank you.