Exploring Automatic Music Transcription: Insights from Dr. Emmanuel Beneteau

Convert Your Audio To Text

4.9/5

3745 customer reviews

Dr. Emmanuel Beneteau delves into the complexities of automatic music transcription, discussing its challenges, applications, and interdisciplinary connections.

Automatic Music Transcription by Dr. Emmanouil Benetos - Part 1

Added on 09/06/2024

Speakers

Add new speaker

Speaker 1: It's my pleasure today to introduce Dr. Emmanuel Beneteau, who is a senior lecturer from Primary University of London. And I'm very pleased that he has agreed to give a tutorial on a very interesting topic called Automatic Immune Transcription. And I think, since everybody registered, I expect you to have read this announcement, including his bio-data. I don't want to waste time. We can set up time, and then actually, let's take as long as you want.

Speaker 2: OK. Thanks very much. And also, thanks to Professor Wang Yi, who organized this research visit and this tutorial. And thank you all very much for coming. I hope this will be an interesting three hours. And sorry for the slight delay. There were some audio-visual problems that we had to address. Before I start, can I have a quick raise of hands? How many of you, which of you are from computer science? OK. So there's a good bunch of people who are not computer science. That's great. Anyone from a music background? Great. Anyone not from the US? OK. That's excellent. So I'm very pleased that there's a diverse audience. So for this tutorial, I guess that there will be some computer science-y stuff being presented. And if this doesn't make sense, I would please welcome you to interrupt me and ask any questions you might have. Likewise, there might be some things which are about music or musicology. Likewise, if you don't understand some of the terms I'm using, please feel free to interrupt me. Generally, this tutorial is mostly meant for people who have some computing or digital expertise and some familiarity, not maybe formal training with music. That's the idea, anyway. So for this tutorial, I've also set up a small website, which is in the URL there. Feel free to go there. You can download a small PDF with the handouts of these slides. And also throughout this tutorial, I'm going to be referencing some papers. You'll see sometimes some citations with a last author name and some year. It might be a bit puzzling, but if you go to this website, there's also a bibliography file, which has all the complete references that I have been using. If you want to dig deeper into some of the topics, I will be presenting there. Also, in this website, you'll find some links for downloading a sonic visualizer, which is a tool I hopefully will showcase in the second half of the tutorial, as well as some music examples that I will be demoing also, hopefully, in the second half of the tutorial. This tutorial was part of an ongoing process. I have been working in this topic for a few years now. And this came into being as a collection of works that I have done, along with other people. Simon Dixon, who's at Queen Mary. Ansi Klappuri, who's a musician in Finland. Jia Duan, who's at Rochester University in the U.S. Rodrigo Schramm, who's at the Federal University of Rio Grande do Sul in Brazil. So, it's quite a diverse group of people who have been working along with me on this topic. And so, this tutorial is a collection of our combined views on this matter. So, what are we going to do for the next three hours or so? So, there's going to be an introduction on this topic. And then, I want to spend some time, before going into the automatic part, to talk a bit about how humans transcribe music. So, this is more related, maybe, to what we could call music cognition. The field of music psychology, music perception, and so on. And then, I want to go into the meat of the presentation, which is Section 3, which is the state-of-the-art research in automatic music transcription. Then, some practical information about data sets and ways to evaluate these things, and how this problem relates to, also, other problems within the wider field of sound and music computing that Professor Yi Wang is working on. There will be a small demo, as well. And then, another big chunk of the tutorial will be on more advanced topics in music transcription. Again, please do interrupt me at any point. I'm really looking forward to hear some good questions and some views about the subject. This is meant to be a slightly critical discussion about transcription. So, it's not meant to be, sort of, everything that I say will be, hopefully, with a critical perspective. And I also encourage you to be critical, too. So, what is exactly music transcription, then? So, when we say music transcription, without the automatic thing in the beginning, we usually refer to having a human subject, a music transcriber, who would either write something down, maybe in stock notation, music notation, just by listening to that, or could, maybe, create an arrangement of another composition. So, they have a score in front of them, and they rearrange it into another score. Or, they listen to a music performance or a music recording, and then they write down the score. So, what is automatic transcription? It's a process for automatically converting an audio recording into some form of musical notation. And when I say some form, it could be human readable notation. So, maybe something like that. Stock notation, Western stock notation. But it could be also machine readable notation, in the form of a MIDI file, or more advanced machine readable notation formats, such as MusicXML or MEI, which are now widely used in the digital music community. The process for coming from this part here, which is a waveform, to this thing here in the bottom, which is the actual score, is a tricky one. And there's only a few, a handful of methods that directly go from here to there. Usually, we have to do something in between. We have to compute, maybe, an intermediate representation, which identifies some core objects of music. This could be notes, pitches, onsets and offsets. When a note starts and when a note ends, the onset is the starting part of the note, the offset is the end of the note. Stream, or voice. You might have many instruments in your music recording. You might need to identify them and follow them over time. Loudness, which is also related to estimating dynamics. And then we have also other aspects, such as expression, articulation, and so on. And then, after we have identified all of these through the audio signal, the waveform, then we need to put them all together somehow. In a way that could be potentially read by humans, so that is essentially what we call typesetting or engraving. It also implies, as well, that in the waveform, we refer to things in terms of time, in terms of seconds or milliseconds. When we have a music score such as this one, we refer to components in terms of, not in terms of seconds, but in terms of maybe beats, or downbeats, beat subdivisions. So, it's a different sort of temporal resolution, in a sense. So, we move from the time level here to the beat level in the final transcription. So, why do we care about this? So, it is actually considered to be a fundamental and still quite open problem in the field of music information research. You might have come across these terms, MIR, using a different acronym, which stands for Music Information Retrieval. And also, Professor Wang Yi is also referring to sometimes as Sound and Music Computing, and I think some of you are, in fact, students from Professor's Sound and Music Computing class. The field of music information research is aimed to essentially make sense of music data. And when I say music data, this could be audio, this could be scores, could be images in the sense of digitized manuscripts, could be video recordings from music performances, even could be gestural data from performers. And for transcription, there are many applications in this field of music information research. For example, on searching and annotating automatically music information. Automatic music transcription is technologies are used, for example, in all state-of-the-art systems for music similarity and recommendation. Anything that you might see when you open Spotify or iTunes, in Apple or Google Play, all these kinds of things. In the back end, they have some sort of transcription component there. Also, interactive music systems. So, it is quite widely used in the sense that maybe we want to create a system that does automatic music accompaniment. Maybe a music minus one kind of thing. So, we perform and then the system is trying to identify what we are doing and follows along with us. Music education. It is being used for automatic instrument tutoring. Here in the US, actually, there has been quite a lot of research in this area. Music production. Ways of mixing, mastering, putting together music tracks in order to create new recordings. And then, finally, also digital or computational musicology. So, the study of the actual field of music. And music transcription has maybe proven to be somewhat of a breakthrough from going beyond the study of manuscripts and scores towards maybe the study of audio recordings. And it is widely used nowadays in the area of systematic musicology or digital musicology. And especially also in the field of ethnomusicology where there is usually not a score in hand. Automatic music transcription as a task. It is one of the many tasks found in music information research. And there are also quite a lot of connections with other tasks in music information research. For example, source separation. Let's say the problem of source separation is when we have as input an audio recording with multiple sound sources. Could be many instruments. Could be instruments and singing voice. And we want to separate those waveforms. And the problem of recognizing the notes and instruments from separating them is, of course, quite close. As is also score following. Score following is a problem when we have access to an audio performance. Could be real-time audio performance. And we have also access to a score. And we want to link, align the performance with the score itself. Maybe for automatic page turning, for instance. Structural segmentation. So, this is the case, for example, where we... So, this is a wide microphone, so I'll probably trip over at some point. Where we... Let's say we have a music piece. Music piece. And we want to identify different segments within the piece. Maybe this is the A part. Maybe this B is the chorus part. A, B, A. Something like that. And, of course, doing this thing from audio automatically is quite challenging. But if you do that from the score, then things get much simpler in that case. Cover song detection as well. So, many songs nowadays, especially with YouTube, we see a lot of cover versions of songs. And identifying a cover song just from audio is quite challenging. But if you simply convert the two transcriptions, the scores, then it's actually quite straightforward to judge if this particular YouTube clip is actually a cover song from an existing one. And maybe more generally, automatic music transcription provides a link between the audio domain, what we would call music signal processing, and the symbolic domain, the music notation domain, or the music language domain, as it's called. Or symbolic music processing. Beyond music itself, automatic music transcription is related to quite a few tasks in maybe the more wider area of multimedia or multimodal computing, if you would call it like that. For instance, automatic speech recognition. So that's a task where basically we have spoken voice away from that and we want to identify the words. This is quite similar. There are, of course, some differences in the sense that in speech, we usually have one word at a time. We have one phoneme at a time. One speaker at a time, maybe. Maybe we have more speakers. But these speakers could be probably independent from each other. So whereas music, you can have many sources performing concurrently, you can have many notes being present concurrently, which complicates the problem a bit. Another problem which is kind of emerging, and this was the topic of my seminar last Wednesday here at the US, was sound event detection, which is basically the problem of trying to identify specific sounds in any kind of environment. Again, this is quite similar in the sense that we are aiming to identify multiple concurrent sounds. Computer vision. Many of the methods that I will talk about today are in fact inspired from computer vision literature. This is because the way that we usually view sounds is not through the waveform like this one, but it's through a spectrogram, which is a two-dimensional representation, quite similar, although not identical to an image. So we use a lot of image processing methods in order to make sense of sounds. And finally, natural language processing. Anything related to text. So quite often in text, also in speech recognition, we have a sequence of words, and we want to predict the next word. Or we have a sentence or a piece of text, and we want to perform syntactical analysis or grammatical analysis to try to understand the text. Similar methods are also used in music transcription, where we have a sequence of notes, and we want to predict what will come afterwards. And this could potentially improve our models. And of course, we also have the added bonus problem in music that as opposed to speech, when we always have one word followed by another word, but we don't have concurrent words in music, we have many concurrent notes, which complicates the problem a bit. In the end of the day, music transcription is not just about academic research, the research I presented. There's a lot of commercial interest. There's both social and economic impact to be found in music transcription. This is just a small collection of software tools that are commercially available. This is, for example, Sibelius, that has a music transcription plugin. Sibelius, to those who don't know, is one of the more commonly used music notation editors. This is ScoreCloud, for which one of its key components is to provide its users with the feature of transcribing automatically their acoustic performances. And Melodyne, which might be also another way of...it doesn't produce a staff notation in the sense that ScoreCloud might provide. It provides a sort of intermediate notation that identifies maybe aspects of the music performance as well as aspects of the score itself. Another maybe key component in the discussion of automatic transcription is the difference between what we might call prescriptive notation versus descriptive notation. In a nutshell, we could call descriptive notation maybe something like this, where we have some sort of representation that tries to identify what is happening in a music performance. And then we have prescriptive notation, so let's say a recipe, for instance, which is essentially that it doesn't capture the details or the intricacies of a music performance. It provides a recipe for other people to maybe try to reproduce a composition, possibly resulting to a different performance. The problem of automatic music transcription is divided into quite a few tasks. It's quite a holistic task by itself, and it's divided into several subtasks. One thing, and maybe the main thing we have to do towards coming up with a good transcription is to identify the notes, the pitches. We call that sometimes pitch detection or multi-pitch detection to indicate the presence of multiple concurrent pitches. We have to identify for every note, we have to identify their start and end position, both in terms of time, but also to quantize them in terms of our beats, our rhythm, our meter. We need to identify the instruments, and also, even more challenging, we need to identify every single note to which instrument does it belong to. We need to identify elements of rhythm and meter. We need to identify dynamics and expression, and then we need to put everything together in order to come up with a final score, at least in staff notation. Some applications of transcription don't require this kind of level of notation. They might simply require a MIDI file in terms of performance notation, so the timings would be in terms of seconds, or it could be a more complex representation, such as MEI or music XML, where we do need to temporarily quantize the music recording that we have. And the core problem, as I mentioned before, is multi-pitch detection. So this is a short segment of one of Bach's preludes. ♪ So what do we observe here? This is piano music. We have two notes played concurrently at a time, and these notes are actually cart-related with respect to rules from music theory. If we try to visualize the audio recording in terms of a spectrogram, which is this representation over frequency and time, so kind of like an image, we have this grayscale image in the back. What we see is a blurred sort of thing with a lot of frequencies. So every straight line, gray line here, represents a specific harmonic or frequency over time. A lot of noise in the low frequencies here because the recording quality that I played was not very good because also the piano as a pitched percussive instrument, you have hammers hitting the strings, which also creates some noise, some transients. And the challenge is, based on this blurry representation, to try to identify these red squares here, the red rectangles there, which are the pitches we want to identify with a start and an end time for each pitch. And this is, in a nutshell, the challenging aspect of multi-pitch detection. So how difficult is it? I want you to sing now. So let's listen to a piece and try to transcribe or hum the different tracks. So I'm going to play that piece now, as loud as I can. I know it's not too loud here. I'm going to play that again. Okay, so who is now confident enough, I will help as well, to try to hum maybe the top melody. Okay, I'm going to play the piece and then we'll start together. Great. Okay, you got it definitely. So yeah, if I play back a MIDI file of the top instrument. Yep, so okay, we got that. How about the rest? That's now becoming a bit more challenging. I'm going to play that again and I'm not going to ask you to hum, I'm going to ask you to listen now. Okay, let's play that again and try to hum the middle voice. I see less participation this time. That kind of makes sense. Okay, I'm going to listen to the piece again, no humming. Try to follow that low voice now. Oops, sorry. Okay, so let's try to hum the low voice now. So, how do they sound like actually? That's the middle voice. And the lower one. Yep. So, yeah, it is not really easy to do this kind of thing, I guess. Maybe for, let's say, the average non, maybe too musically trained person, it's relatively easy to hum the melody. But when it comes to the lower voices and the middle voices, then things become a bit more tricky. And often, in order to master the skill of transcription, quite a few years of musical training are required. So, the challenge is, if we find it so difficult to do the same, how can we make a machine to do that kind of thing? Sorry. Sorry. So, however, there are some maybe more gifted people that have this kind of skill. So, there was this anecdote from Mozart that allegedly he was able to recreate a complex performance entirely from memory just by listening to it once. And can we compete with that? That's the question. We're not quite there yet. So, what are the challenges in doing this automatic transcription then? So, as I said before, in speech recognition, which is a closely related topic, you usually have one word at a time, one phoneme at a time. In music, you have many concurrent notes, you have many concurrent sources, which complicates the problem. That's fine. But, you know, there are other fields like computer vision, for instance. In computer vision, usually you have techniques that can identify multiple objects within a video recording or an image. That's fine. What is the difference? What's the key challenge here in music? The key challenge is what we would call independence. In speech and in computer vision, you can have a room with many speakers talking at the same time. That's fine. But each person, each speaker is independent. The words that they say, they don't really match with the words that the person next to them would be talking. The timings will not match. The content will not match. They would be talking independently. Objects, if I take a video recording of you like that, then each person is moving independently from the person next to you. Whereas in music, very often, almost every time, we are supposed to have sound events that occur together. We have notes from different instruments. They're supposed to be playing together at the same time. And the notes that they're performing, they might be exactly the same notes or they might be notes that are related somehow. They might be what we call harmonically related notes. So you might have one instrument playing a C, another instrument playing an E, and another instrument playing a G. These notes are not just random notes. Their combination is such so that it forms a C major chord. And this is important. Because in music, compared to all these other modalities, we have what we could call, in a more mathematical sense, strongly correlated sound events. We do not have any sort of independence. And this is what complicates the problem. The other problem we have is that we usually, when we are transcribing automatic music, we only have access to a recording that, best case scenario, it's a stereo recording, which means we have two channels.

Summary

Generate a brief summary highlighting the main points of the transcript.

Generate

Title

Generate a concise and relevant title for the transcript based on the main themes and content discussed.

Generate

Keywords

Identify and highlight the key words or phrases most relevant to the content of the transcript.

Generate

Enter your query

Submit

Sentiments

Analyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.

Generate

Quizzes

Create interactive quizzes based on the content of the transcript to test comprehension or engage users.

Generate

Back

Forward

{{ Math.round(speed * 100) / 100 }}x

Select Audio file

Convert Your Audio To Text

Secure and Encryption, NDA

4.9/5 3745 customer reviews

1/737

Verified Order

“I've utilized GoTranscript as a Producer for many projects in many languages and I'm very happy with their services. Their turnaround time is amazing, and more importantly their accuracy of providing a human transcriber is accurate -- and I can trust them, regardless of the language.”

David Haneke

Nov 25, 2025

“I loved it”

Ivy

Oct 29, 2025

“Price is fair, accurate transcriptions and user friendly.I would recommend.”

Robert

Oct 20, 2025

“I am delighted I chose your service. The human interpreter did all I needed. I chose GoTranscript because of the time I saved by having this done. Thank you.”

Alfred

Oct 16, 2025

We Trust in Human Precision

Value-Driven Pricing

Trusted by Global Leaders

GoTranscript

24/7 Customer Support