Deep Neural Networks for Automatic Lyrics Transcription: A PhD Journey
Explore Emir Demirel's PhD research on using deep neural networks for automatic lyrics transcription, supervised by experts from QMUL and Doremir Music Research.
File
Deep Neural Networks for Automatic Lyrics Transcription
Added on 09/06/2024
Speakers
add Add new speaker

Speaker 1: Good morning everyone, I hope everyone is having a great weather just like we have in London today. Well, so I'm Emir Demirel or QMUL1, that's my barcode within MIP Frontiers project. I'm doing my PhD in Queen Mary University of London and my research is supervised by Professor Simon Dixon from Centre for Digital Music and Professor Sven Aalbak from Doremir Music Research AV from Stockholm, Sweden. And I will be talking about my PhD thesis which I give the title Deep Neural Networks for Automatic Lyrics Transcription. First I will talk about my motivation why I chose this title, then we will define the problem and then we'll go through the state-of-the-art methods that attempted to solve this problem and then in the Automatic Lyrics Transcription chapter we will go through our contributions in this research field and finally we will finish with summarization of our contributions and the outcome of this research. Hopefully if you have time we will have some demonstration. So this was my original thesis proposal, Representations and Models for Singing Voice Transcription. So in general automatic music transcribers were mostly modeling singing voice by means of pitch versus time. Obviously there are other aspects of singing voice that are useful for defining the performance such as vocal qualities, styles, note transitions, intonation and last but not least lyrics. As lyrics is one of the essential building blocks for the understanding, appreciation and representing singing voice performances, we thought that this would be an essential feature for a potential automatic music transcription application. An example you see here, this is our research outcome. This is a commercial application developed by Doremir but I will come back to that later on. So just to give an idea that we were actually built something that works. So let's define the task. That's my definition. The task of automatic lyrics transcription can be defined as the procedure of transforming its singing voice performance with lyrics that has a finite length audio signal into a string of text which is targeted to match that of the origin lyrics or what humans would generate. So in mathematical terms we can summarize this problem with this equation which is predicting the likeliest word sequence noted by w here given the acoustic features or observations noted by x which is the fundamental equation for speech recognition. So when we apply this into singing data this can be easily translated to automatic lyrics transcriptions. This lovely equation is the thing that we're trying to solve. From the perspective of speech recognition there are two major streams of approaches in the state of the art. So the first one is the DNNH1 based, the other one is the Antoine based models. The DNNH1 based models are composed of multiple independent computational blocks that build the overall word classifier whereas Antoine models provide a single module that is purely consisting of neural networks. Specifically for DNNH1 models they use this concept called phonemes which are the basic units of speech as also Kilian mentions in his presentation. Words are considered to be consisting of sequences of phonemes. So first we have this acoustic model that learns a mapping between acoustic observations and phonemes and phonemes or phoneme probabilities are converted to word probabilities using a pronunciation model. An example is given on the right hand side and to obtain more grammatically sensible outputs we smooth out the word probabilities with a language model. After we obtain the word probabilities we obtain such a graph. This graph stores all the alternative transcriptions so we apply decoding from left to right to obtain the final transcriptions and this can be applied using the traditional algorithms such as beam search, Viterbi, etc. And then the state of the art training approach is called sequence discriminative training which uses maximum mutual information as the objective function. It is quite a spicy formula so I didn't include that in this slide to avoid any kind of distractions. If you'd like to learn more about that feel free to check the reference. In principle this is like CTC optimization if anyone is familiar with that function. So basically the optimization is done on the sequence level rather than frame level. So as I said end-to-end models on the other hand they are purely consisting of neural networks. They don't require a pronunciation model as you see here. Because of that we don't need prior or linguistic expertise which makes them really attractive in fact. Specifically sequence-to-sequence structures are very popular in the context of speech recognition where the audio acoustic features are first encoded into a latent space that is usually referred as context vector and then this context vector is then decoded into output transcriptions with a decoder network. And you can use CTC or attachment as the middle block. I'm not going to go into details of that. And most recently transformers were introduced which was considered to be a breakthrough in deep learning research because they made CNNs and RNNs redundant according to this scheme. They are purely based on self-attachment of fully connected layers and due to that they have fewer trainable parameters. They are parallelizable unlike RNNs and because of that they come with faster training option. And on the other hand the state-of-the-art end-to-end speech recognition system is also based on the transformer architecture. This is the literature review that will be useful to understand the following context. So let's talk about something more interesting. Let's transition from speech data to singing data. When we were conducting our research we have oriented our progression based on the research questions that are displayed in the slide. What is the performance ceiling for lyrics transcription when directly adapting the state-of-the-art speech recognition to singing data? Most of the times words are pronounced differently in singing compared to speech. So what are these systematic or common pronunciation variances and how do they affect overall word recognition performance? Also in our research we observed that lyrics transcribers that are trained on a cappella recordings perform absolutely poorly on polyphonic recordings and the vice versa scenario is also valid. Polyphonic models perform quite poorly on a cappella recordings. So we also asked is it possible to obtain a single acoustic model where the performance is undisturbed across varying domains? And finally one of the hottest questions in deep learning research which one is better DNN-HMM or end-to-end models? We are trying to answer that question as well. In our first publication it was concerned with the first research question that we were tackling in. We used a state-of-the-art DNN-HMM speech recognition framework trained on the DAMP data set which is the benchmark training data set used in lyrics transcription that consists of only monophonic recordings. We developed a self-attachment based novel acoustic model and we exploited neural networks for also building the language models and then we were able to obtain five percent absolute word error rate improvement compared to the previous results. So word error rate is the standard metric to evaluate lyrics or speech recognition systems and all in all the shorter the better. So it's an error rate basically. In our next publication we applied a comparative study on the phonetic variances in word pronunciations between their canonical forms and in singing. In broader terms we have observed that while no surprise words are uttered with longer vowels and also there were certain consonants that were omitted if they were supposedly pronounced towards word endings. Obviously this is a language specific and our experiments were held on English language. So we embedded our observations explicitly on a speech pronunciation dictionary and we have created a singer adapted version of it and we use that singer adapted lexicon in lyrics transcription experiments. We have observed consistent but marginal improvements. If you'd like to check the results please copy and paste the title of this slide because I do not include a lot of numbers to cause any further distraction. So in our most recent paper it's called the MysteryNet. It's a multi-streaming acoustic modeling for lyrics transcription. We try to answer both of these questions if it's possible to obtain a single acoustic model or a cross-tabin model and also try to apply the latest state-of-the-art to lyrics transcription. I'd like to talk a little bit about the multi-stream architecture. So multi-stream time delay neural network architecture is actually the state-of-the-art neural network architecture for the DLNHRM based ASR. Time delay neural networks are essentially one-dimensional convolutional neural networks where the convolution is applied on the time domain with a dilation rate greater than zero. So instead of convolving consecutive frames we convolve frames dilated on time. A single stream architecture can be seen on the leftmost image and the time delay blocks are highlighted with this dotted rectangular shape. In the multi-stream architecture which is inspired by how human auditory system processes acoustic information in multiple resolutions, in this architecture there are multiple streams of time delay blocks are included which operate in parallel where each stream has a unique dilation rate. However this architecture has identical structures for each stream in terms of hidden layer dimensions and number of hidden layers. So we compacted this architecture where each stream has a unique structure in terms of these parameters. The second method we proposed here was to develop a cross-domain model. So usually least transcribers trained on either a capella singing or polyphonic recordings or specifically DAMP or DALI dataset. As a simple but effective approach we merge both of these datasets and then train a single model to obtain the cross-domain model. And finally we propose to use music-informed silence modeling in usually in speech recognition. There are a set of phonemes that are used as target classes which can be seen on the right that is for the English language and in addition there is a silence phoneme inserted in the set of classes which is used to represent non-vocal instances. In addition to this we also insert another silence phoneme which we refer to as music phoneme. We explicitly embed this information prior to training through tagging the lyrics through explicitly including these silence tokens at the beginning and at the ends of each lyrics line and then we exploited the fact that we know DAMP data is only a capella and DALI data is only polyphonic. And then both of these method two and method three led to considerable improvements. We were not only able to maintain performance across different domains but we were also able to achieve performance improvements. On the other side music-informed silence tagging or silence modeling was mostly effective for polyphonic case but not so much for a capella case. So yeah the lower the error rate is better again. So the final results in this paper we included our mystery net with the other methods that we proposed. Note that in this paper we have also introduced a new data set for benchmarking polyphonic lyrics transcription which is a subset of the DALI data set and the annotations are manually verified which is the largest evaluation set for lyrics transcription with the highest musical variability and while we maintain singer's gender. And this data can be retrieved by the tutorial we provide at this link. So just to summarize these results in this table if you just take a look at the two bottom lines the second line from the bottom is a state-of-the-art that is reported in the literature and the bottom line is the results that we reported in that paper. So as you see especially for the polyphonic case that's a substantial improvement and now this is the state-of-the-art by large margin. For the information the other state-of-the-art for the monophonic recording is our previous publications. So our last research question DNN or HMM models whichever is more suitable for this task. For the DNN HMM model we used our mystery net and for the Antoine model we used a pre-trained acoustic and language model trained on the large benchmark speech corpus, library speech corpus. We fine-tuned both the acoustic and language models on the DAMP data set and we used the transformer architecture for that. The best results for both of these methods are similar comparable although the DNN HMM model is still better. So the best results for the DNN HMM approach was achieved through training a model solely on singing data whereas the best results were achieved for Antoine models through transfer learning or fine-tuning. So first we need a very good model trained on large speech data set and we fine-tuned it on singing. Note that transfer learning did not work quite well for DNN HMM models. On the contrary the Antoine model trained solely on singing data performs absolutely horrendously. It's an interesting observation. So I'd like to ask an A or B question for everyone. Whichever do you think is most suitable in this task? Just maybe raise a hand on Zoom. Do you think DNN HMMs are most suitable for lyrics transcription or Antoine models? Anyways well it's a really hard question to answer but let me give my input on this. Let's go through the pros and cons for each method. Well it appears that DNN HMM models are suitable for low resource scenario. They come with better performance and they have multiple independent blocks that can be individually optimized and we have tested this model on real world data. On the other hand neural networks are not fully exploited and there are lots of complicated data processing steps such as alignments, building phoneme context trees etc. On the other hand having multiple independent building blocks is not necessarily a good thing because optimizing a single block does not necessarily lead to a global performance improvement. For this reason finding is difficult and this method requires a strong theoretical background. For Antoine models we don't need a pronunciation model which is the feature of them that makes them most attractive. They're easier to train, minimal data processing involved, neural networks are exploited to its full extent and they're more modern and flexible. On the other hand they require very large training datasets to be able to get good performance. They have high memory requirements on the GPUs and we have not tested this system on real world data yet. To summarize our contributions we were able to achieve the state-of-the-art in ALT in lyrics transcription across all benchmarks datasets. This is especially true for polyphonic recordings by a large margin. We proposed a number of novel training approaches and novel neural network architectures. We introduced a new evaluation set which is the largest in research and through these contributions we are benchmarking lyrics transcription research for both DNNHMM and Antoine's based approaches. So I'm going to show a few examples. The first one is a monophonic example and I hope everything goes well with the audio setting. So it looks pretty good. It was actually 100% accurate in this case and as you see it was able to model longer vowels so yeah we were actually quite happy to see this result and when we first had it. And now I'm going to show some polyphonic examples but bear in mind that the examples that I'm going to show do not necessarily represent my music taste. Yeah let's let's have a listen to those. So as you see the upper row is the ground truth and the lower is the hypothesis. The wrong words highlighted with red. So another one this is a hip-hop song. This result is impressive. That's good. So another one in this example I'd like you to focus on the very end of the sample and I guess you'll have an idea what I'm talking about. So our lyrics transcriber was attempting to transcribe the percussive sounds which is unintended and that's not my PhD thesis. So obviously you know we don't claim that we have solved this problem. We haven't solved the automatic lyrics transcription problem. There is obviously still large room for improvement and in our opinion and based on our previous experience the performance improvement can be achieved through data augmentation or exploiting the available data to its greater extent through semi-supervised learning or representation learning methods. Also we have to inject more music priors in the in this whole pipeline such as leveraging rhyming information and also multilingual lyrics transcription is an interesting research field. Having a multilingual transcriber would be actually commercially impactful. Also there are still music specific challenging cases such as brutal vocals like in death metal. I actually tested my system and it fails completely. Opera singing is another challenge and also we have a lot of custom words in lyrics that we do not see frequently in natural speech. I'll give an example about that such as this sample here. So it's a little bit ambiguous how to model such instances but that's also a future challenge. And our outcomes research outcomes well within the context of lyrics transcription at least we were able to publish five peer-reviewed papers in top conferences. All of our lyrics transcription training software is open source and in github. We have introduced a new evaluation set again the largest in research so far and we were able to develop through our collaboration with doremir we were able to develop the first commercial application of automatic lyrics transcription technology which will be released in early 2022. Maybe you can have a screenshot and then copy and paste that youtube link and that's my email address and this is the references. Thanks for your attention.

ai AI Insights
Summary

Generate a brief summary highlighting the main points of the transcript.

Generate
Title

Generate a concise and relevant title for the transcript based on the main themes and content discussed.

Generate
Keywords

Identify and highlight the key words or phrases most relevant to the content of the transcript.

Generate
Enter your query
Sentiments

Analyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.

Generate
Quizzes

Create interactive quizzes based on the content of the transcript to test comprehension or engage users.

Generate
{{ secondsToHumanTime(time) }}
Back
Forward
{{ Math.round(speed * 100) / 100 }}x
{{ secondsToHumanTime(duration) }}
close
New speaker
Add speaker
close
Edit speaker
Save changes
close
Share Transcript