Advancements in Automatic Lyrics Transcription: Bridging Speech and Singing Recognition

Convert Your Audio To Text

4.9/5

3727 customer reviews

Emir Demirel discusses his PhD research on improving automatic lyrics transcription from monophonic recordings, highlighting key challenges and innovations.

Automatic lyrics transcription based on hybrid speech recognition - Emir Demirel

Added on 09/06/2024

Speakers

Add new speaker

Speaker 1: So, this is Emir Demirel. I'm doing a PhD at the Center for Digital Music in Queen Mary University of London, working in a team with Professor Simon Dixon and Doraemon Music Research AB from Sweden. My research focuses on automatic lyrics transcription from monophonic recordings, as you see, which I'll be talking about in the next few minutes. So, here's the content. I will first introduce the concept of lyrics transcription and go through the basics of it. Then I will go through my own contributions in this field. Then we will conclude with final remarks. So, automatic lyrics transcription is the process of recognizing words and phonemes from singing voice signals. So, basically, it can be perceived as speech recognition from musical performances, musical recordings. But even the strongest speech recognizers fail to have a robust performance when it comes to singing. For instance, in this slide, you see what you see in the blue box is what YouTube's own recognizer recognizes from this singing voice performance. However, below, you see the original lyrics. So, there's definitely a room for improvement there, considering the domain-specific properties of singing compared to speech. So, our research tries to tackle this and, you know, solve this problem and bridge the gap between speech recognition and lyrics transcription. To give a little bit of context, I'd like to briefly mention the basics of speech recognition systems. So, the equation you see on the right-hand side is the main equation of speech recognition, which means finding the most probable word sequence given the acoustic observation, such as X. Traditional speech recognition systems have three major components. The first one is the acoustic model, and the probability you see in the red box on the right-hand side models that, which tells us what is the likelihood of observing such an acoustic instance knowing that it belongs to a certain word sequence. And then we have the language model, which is the PW here, which tells us what is the probability, what is the likelihood of a word occurring in a word sequence or in a sentence. And this probability is statistically obtained through N-gram language models, which is trained, which is generally trained on a text corpus. In the context of lyrics transcription, since we are interested in lyrics, literally, we curate our text corpus from song lyrics only. And finally, we have the pronunciation model. In speech recognition, words are considered to be consisting of phonemes, which are the basic sonic unit of speech. The mapping between phoneme sequences and words are defined by a phonetic lexicon, which usually has a form, as you see on the left-hand side. Note that by vectorizing phoneme sequences for a certain word, we make alternative pronunciations of a word possible. So these are the basics, and now I'd like to move on with our own contributions in lyrics transcription. In our latest work, we have focused on constructing a strong acoustic model for lyrics transcription. We have presented the neural network architecture, the neural network architecture that you can see on the right. The main component of this neural network architecture is the time delay layers that you can see. It is noted as T, D, N, and F in the middle, which are basically 1D dilated convolution operations over time. And they are proven to perform well and model the long-term context dependencies well. And as opposed to RNNs, they are parallelizable, so it's computationally less expensive. However, they still suffer from dimensionality. For this reason, we have added, to provide compact and more robust features for these time delay layers, we have added six layers of 2D convolution layers in the back end with subsampling after each layer, after every other layer. And finally, to refine the context when making predictions and also to give a time constraint on the predictions, we have added a self-attention layer on top of our neural network. We have chosen the maximum mutual information criteria as our objective function, which aims to maximize the shared information between reference and the target sequence. With this system, we were able to achieve 33% relative word error rate improvement compared to the previous best system in monophonic lyrics transcription. Here is one output example of our system. So the green horizontal bars is the phoneme segments, and the vertical white lines indicate the word segments. And I also added the pitch curves, which you can see as the discrete blue curves, to emphasize that even in the drastic pitch changes, the phoneme and word segmentation is able to get good performance. One major challenge in lyrics transcription is the pronunciation differences when people utter words in singing. To be able to understand these pronunciation differences, we recently conducted a computational analysis on the pronunciation variances during singing. One of the first observations that we had is that vowels are pronounced longer during singing. Further, we have applied a comparative confusion analysis based on phoneme types, and we have seen that people interchange phonemes when they are uttering words during singing. This happens especially during vowels, but also we see that singers tend to omit plosives, sounds like B, D, G, P, K, T. We also see that they change fricatives with each other. Further, we see that the temporal evolution of pitch curves may change from performance to performance or from singer to singer. For instance, in the picture on the right, we see a more oscillating curve. In music information retrieval, we would call this pattern more like a vibrato-like pattern, whereas on the left-hand side, it's a more stable pitch curve, even though both performances aim to sing the same words and the same melody. These aspects of pitch curves obviously has an effect on how words are pronounced and also the lyrics transcription system. When attempting to recognize and transcribe words from singing using a speech-based pronunciation model, we may end up funny errors, as you can see in the red box in the slide. For this reason, we need to be careful and we need to consider the domain-specific properties of singing. So, let's do a recap. I'd like you to have these take-home messages from this talk. What we have learned is lyrics transcription is the process of recognizing words and phonemes from singing voice signals. Traditional automatic speech recognition systems have three major components, the acoustic, the language, and the pronunciation models. And finally, to achieve a better lyrics transcription system, speech recognition systems need to be adapted by means of all of these components. Especially, we need to consider the temporal evolution of pitch and the phonetic contents. I'm Emir Demirel and you can reach my work through this GitHub link you can see or you can directly email me and thanks for your attention.

Summary

Generate a brief summary highlighting the main points of the transcript.

Generate

Title

Generate a concise and relevant title for the transcript based on the main themes and content discussed.

Generate

Keywords

Identify and highlight the key words or phrases most relevant to the content of the transcript.

Generate

Enter your query

Submit

Sentiments

Analyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.

Generate

Quizzes

Create interactive quizzes based on the content of the transcript to test comprehension or engage users.

Generate

Back

Forward

{{ Math.round(speed * 100) / 100 }}x

Select Audio file

Convert Your Audio To Text

Secure and Encryption, NDA

4.9/5 3727 customer reviews

1/732

Verified Order

“I haven't used the customer support yet, but the interface, guides, and easy access to the contact buttons are promising. The output is also really accurate and well-executed:)”

keziah

Aug 15, 2025

“Service is very fast and easy. I noticed a few errors but they were minor. I like your service.”

MICHAEL TRENT

Aug 12, 2025

“Excellent service!”

DanutM

“Excellent service, thank you very much!”

Samantha Cava

Aug 11, 2025

We Trust in Human Precision

Value-Driven Pricing

Trusted by Global Leaders

GoTranscript

24/7 Customer Support