Advancements in Automatic Music Transcription: Techniques and Applications

Convert Your Audio To Text

4.9/5

3727 customer reviews

Explore the latest research on automatic music transcription, focusing on matrix factorization methods, challenges, and diverse applications in music and beyond.

MIC 2016 Emmanouil Benetos - Automatic music transcription using matrix decomposition methods

Added on 09/06/2024

Speakers

Add new speaker

Speaker 1: So, thank you very much, first of all, for the invitation to come here, and for the opportunity

Speaker 2: to work for the first time, especially to Andrew and Mark. So the first talk, as Mark said, is going to be probably the most technical of all, and I'm going to talk about my PhD and postdoc research on automatic music transcription. Before doing that, just to put everything in context, I'm going to talk a bit about where I come from, so the Centre for Digital Music at Queen Mary University of London. So it's a fairly large group, formed in 2003. Its main goal is to do world-leading research into digital technologies for understanding and innovation in music and audio. C4DM, as it's commonly known, has organised quite a few conferences over the years. With respect to teaching, we now have a Master's course for Sound and Music Computing, as well as a PhD Centre for Digital Training in Music and Arts Technology, and you might have come across some of the software being released by the group, mostly such as Sonic Visualiser, which is a software framework for visualising and analysing music recordings. This video here shows the research areas that C4DM is covering at the moment. I'm based somewhere between the music informatics and the machine listening themes, but also there's quite a lot of research, for example in audio engineering, so sound recordings, 2D recording, augmented instruments, new musical instruments, music and human interaction, computational creativity, generative systems for music, mathematical models, this is led for example by Elaine Chu on her work on mathematics and music, performance science, and I think, as I was telling Andrew before, the only bit we're missing is actually on the music language bit, symbolic music processing bits. On funding, the main programmes that C4DM has at the moment is its platform grant, its Centre for Digital Training in Media and Arts Technology, as well as its programme grant in fusing audio and semantic technologies for intelligent music production and consumption. And these are a few of the beneficiaries and industry partners that the group is working with. So now moving on to the actual topic of the talk, automatic music transcription. There have been quite a few definitions. The one I'm going to tell you is that it's defined as a process of converting an acoustic music signal into some form of musical notation. It could be human-readable or machine-readable, like an ED file. And in order to do that, we often have to go through some sort of intermediate process in order to understand, for example, which notes are present at which time in the audio recording, for example, the indexes of the notes and the start and end times, at least. And it is a fundamental and still open problem in this field of music information retrieval, or MIR in short. It has quite a few applications, for example, for creating interactive music systems if you want to do automatic accompaniment. In computational musicology, if you want to come up with a musicological analysis of audio recordings. Also for indexing sound collections or retrieval similarity purposes, you need to come up with a symbolic representation first. And it can be divided into several subtasks. First and foremost is pitch detection. So to detect the notes in the recording, or more correctly, multi-pitch detection. To detect multiple concurrent notes in an audio recording. Then onset and onset detection to detect the end and the starting point of each musical note. Then to identify also the instrument that produces notes. And then to extract also rhythmic information, to identify dynamics or expressive markers. And finally to put everything together, to typeset or engrave all that information into coming up with a human-readable staff notation. Most of the research nowadays in automatic music transcription is coming from the perspective of computer science and electronic engineering. It involves usually signal processing and machine learning methods in order to accomplish the tasks. So for example, most of these techniques for transcription nowadays use audio features. So coming up from the signal processing side of things. Or methods from statistics and probability. I'm going to talk today about matrix decomposition methods, which are linked with both signal processing, mathematics and machine learning in a way. And more recently we've seen quite a few connectionist methods. So methods coming up from the field of machine learning that deals with neural networks. These transcription methods are annually evaluated at this MARIX competition, which is held at the ISMIR conference every year. And there are two tasks that are used to evaluate systems that do automatic transcription. The first task is this multiple zero estimation task. So basically it tries to estimate at each time frame which notes are present in the recording. And then we have a more perceptually meaningful note tracking task, which tries to identify a note on the recording with a start and an end. And there are quite a lot of open challenges still in this automatic music transcription topic. First and foremost is that the performance of these automated methods is still clearly below that of a human expert. I would like to also note that a human expert, someone doing an automatic music transcription is not a trivial task. So you need quite a bit of training in order to come up with a good transcription. And the performance of these systems is quite below, especially in the cases of multiple instrument music. So we have too many instruments playing at once. Also in the cases of high polyphony. So we have too many notes played concurrently. Another challenge is the fact that there's not quite a lot of data to actually do the task. It's really difficult to come up with annotated recordings for each recording. You have annotations for each specific note. Each recording might have hundreds or thousands of notes there. For example, when I was doing my PhD, in order to annotate a one minute recording, I think it might have taken me three days in order to do that. And there's also no unified methodology. So, for example, in a closely related task to automatic music transcription is ASR, automatic speech recognition. Whereas in the ASR field, there's a sort of standardized methodology in terms of the techniques that are used. For example, you use features called NFCCs or a classifier called GMM or HMMs and things like that. There's no such thing in the music transcription world. So moving on now to the matrix factorization methods part. So there was a paper in Nature in 1999 by Lee and Sun proposing this method of so-called non-negative matrix factorization. They came up with this algorithm which could decompose a matrix into a low-rank decomposition, a non-negative matrix. Their main focus at this time was image processing. What they were trying to do was that, assuming we have an input image, they would like to decompose this image as a sort of sum of local parts. Let's say that we have an image of a face. We'd like to decompose that image in terms of one eye, another eye, nose, the mouth, and so on. So we would come up with a dictionary of these elements of a face. When multiplied correctly, it would approximate the original face. That was the main motivation. The big constraint in this method was the fact that the data is non-negative. That means that this model is purely additive. So we just have the parts, we add them with some weights, and we come up with our solution. Over the years, this method has been applied to quite a lot of tasks over the data science field, on detection, on dimensionality reduction, clustering, on classifying, denoising, and predicting, and many more. This has been applied both to images, but also even to text, to video, and also audio recordings. If we put a bit more math into it, the NMF model assumes an input non-negative matrix V, and the goal is to approximate it as a product of two non-negative matrices, W and H. The idea is that the rank of this factorization, so the r here, is small in order to compress the original input. Various algorithms have been proposed in order to do that, for example, using expectation maximization from machine learning theory or gradient descent methods. Various cost functions have been proposed in order to come up with a solution for how to minimize the difference between the input and the output of this problem. But how about audio specifically? In 2003, a smart artist in Brown came up with this idea that we could apply NMF to a spectrogram of an audio recording. Here in this figure, you see the spectrogram of a short piano segment, which I'm about to play now. So basically, a spectrogram, for those not familiar with it, is a two-dimensional representation of frequency in the vertical axis and time in the horizontal axis. You can see clearly in the spectrum these horizontal lines, which denote the fundamental frequency and the harmonics of each note present in the recording. One smart artist in Brown found that when we apply NMF to this spectrogram here, we can come up with two sets of matrices. The one, very conveniently, has in each row a signature for each individual note present in the recording. This particular recording has five notes, which are represented here as a one-dimensional spectrum. The other matrix, the H matrix, has as its output activations for each of the notes in the left part. It tells us when each note is active at which time point in the whole version of the recording. This is essentially like a raw transcription. We can use that information in order to come up with a decision on which note is active at which time frame and come up with a proper staff notation in the end. At roughly the same time in the field of text mining, Thomas Hoffman came up with a technique which he called Probabilistic Latent Semantic Analysis, PLSI, which subsequently was taken over by the video people, which they renamed it as Probabilistic Latent Semantic Analysis, PLSA, which was further then incorporated by the audio people into a technique called now Probabilistic Latent Component Analysis. In fact, what was discovered a few years later was that this PLSI method was essentially a probabilistic counterpart of this NMF non-negative matrix factorization method, which however offered a Bayesian framework that made it easy to combine and fuse other methods coming out from probabilistic machine learning and come up with more interesting models for decomposing matrices. This is how this PLCA model is formulated in the audio domain. If we assume an input audio spectrogram, so a two-dimensional frequency and time, then this PLCA method approximates it as a bivariate probability distribution of frequency and time, which is in turn decomposed into two main matrices. One is a matrix probability of a frequency given a component, in which case a component can be a musical note, and the other one is the activation probability, so which component is active at a given time frame. This is the H matrix in the NMF. Essentially, this kind of model can decompose a spectrogram into a series of spectral bases. Each such basis can correspond to a musical note. And also to event activation, so when is each note active at a given time frame in the recording. This basic model has been used in quite a few applications in the field of machine listening. For example, for multi-bit detection for detecting current notes, or for detecting generally acoustic events in everyday or unknown nature sounds. Also for separating sources or musical instruments in recordings. Another extension that was proposed a few years later was also on convoluted models. So far the model I presented was linear, so it was a matrix that was decomposed into a product of two matrices. We can also come up with convoluted models, which try to extract shifted structures out of non-negative data. Why is that interesting? That is interesting for music because if you try to visualize a spectrum of a music recording into the low frequency domain, instead of the linear frequency domain that is mostly used, then you can observe that pitch shifts, changes in the musical note, can appear as vertical shifts of a common pattern. So here, the spectrogram on the left is recording on a violin glissando. You see that basically everything rises a bit, then it fluctuates and then it drops again. If we apply one of these convoluted methods, we can come up with one sort of pattern which represents the harmonic signature of that specific violin. If we try to shift that vertically, then we come up with this distribution which tells us that we have a note that goes up, fluctuates a bit so there is a vibrato coming up, and then it drops down again. We can actually use these convoluted models to come up with methods that can detect tuning, can detect vibrati, tremoli, and other types of frequency modulations or microtonal analysis in music. Now, this is the part of how these methods have been used in my research on multiple instrument music transcription. The goal I have is basically to create a system that can be used for multiple instrument automatic music transcription in the sense that you don't know exactly what instruments are present in your recording, you don't have specifics about it, so it's a blind system, and can cope with that. And also the constraint I was having was I wanted to also express these frequency modulations or tuning changes through this shift-invariance property I was just telling you about. And I also wanted to incorporate some knowledge about music acoustics in the sense that a musical note is not when it evolves over time, it's spectral signature changes. So, for example, on the left figure here you can see a spectrum of piano notes. This is a C1. As you might also hear, I'm going to play it again. The note evolves over time, it's not exactly the same. So, in the beginning there's a strike of hammer, and the sound is much more noisy, let's say. And then it moves on, we come to this so-called steady state, where you can hear sort of a harmonic sound, and then we come to the decay state, where some of the harmonics decay faster than others. So we can represent this spectrum as a series of one-dimensional spectra, and we try to come up with the rules that constrain this evolution from the attack to the sustain to the decay state of various musical instruments. So this is the diagram of the system. The idea is that it takes as input an audio recording, it computes a time-frequency representation such as a spectrogram, which is then fed into this transcription model, which is based on matrix factorization methods and relies on a dictionary of pre-extracted note templates from various instruments across their whole note range. I think this particular image here shows a small collection of spectra for a piano from note A0 to C8. And then the output of that model is post-processed into finally resulting in MIDI file. And this is the equation summing up the model, and this is the last equation of this talk, by the way. The idea is, again, that we have as input a low-frequency spectrogram, which is decomposed as a series of various probability distributions. The big one here is a five-dimensional dictionary of spectral templates for each note, for each instrument, for each tuning specifications, and for a specific sound state in the point of the evolution of the note, so it is in the attack or the sustain state. Then we have another probability distribution which denotes the tuning for a particular time instant for a particular note in the recording, and tuning with respect to 440 Hz equal-temperature tuning, by the way. And then we have another probability which is the contribution of each instrument in order to produce a note at a specific time instant. And then there's another probability which is the main output of the model, which is when is each note active at each time frame. So that's the main output of the multi-pitch model. And then I finally exhibit probability on at which time frame each note, what is the sound state. So basically, in the model, we have one input spectrum. This is fixed, the dictionary is fixed, and we try to learn tuning, instruments, pitches, and sound states jointly. And this can be done using an iterative expectation maximization algorithm, and also by adding a few temporal constraints on the order of these sound states. In the end, the run times for the system are 1 to 2.5 real-time using a CPU-based method. But because we're talking about matrices, computations can be sped up considerably if we use GPUs, so it will come up with 0.1 times 3 real-time speed. So that means essentially that if I have, let's say, a one-minute recording, then I can do that, I can come up with a transcription on a third of that time. And this particular method has been also evaluated apart from various private datasets to this annual MARACS competition, and in twice in two years, it came up first in these evaluations. You can download the code for this method as well as for a few other methods from the URL below. And that includes also the code for the GPU-based method as well. And a few examples of how well this method works. So on the top figure here, you see the so-called piano roll representation for a piano recording. And this is the transcription. So I don't know if you can hear, but basically the system is mostly able to detect the notes. It does miss a few notes, you can see by comparing the two, the two figures here. But most of the core harmonic content is there. Some of the timings are a bit wobbly, but not terribly so. Another interesting output of this system is that it can be used to do a bit more thorough analysis of high-frequency content. So here is the output of the transcription system. For a string quartet recording. And you can see here that the system was able to detect all the vibrati caused by these bow string instruments. And the final thing to note is that this system is also available as a so-called BAMP plugin. So this is one of the plugins used in the sonic visualizer framework. And it can be used in order to do, for example, real-time transcription, or in order to plug in with other systems for doing any sort of real-time music technology applications. You can export an audio recording into a MIDI file, for example, and things like that. It can be downloaded from the URL there and in quite a few operating systems, Linux, Macs, Mac Windows, and so on. Now, moving on to the topic of music language models and how they can be used to improve automatic transcription. Mostly, multi-page detection methods, transcription methods so far, have been using only acoustic information, so information only from the audio recording. However, there is this hypothesis that we can improve this performance of these transcription systems if we also use prior knowledge from, let's say, the language of music. However, we might be able to define that. In a way, that's similar to how speech recognition systems work. Typically, they consist of an acoustic model which tries to record an audio signal and also a spoken language model. So it tells us the probability that this word is present given the previous context. But there was an obstacle, or quite a few obstacles right now, for doing this sort of thing to music. One obstacle is, of course, how do you define music language? It's not as well defined as, let's say, spoken language given a specific context. Another obstacle is the fact that there are no off-the-shelf methods for modeling waterfront music. Whereas it's quite straightforward to model a spoken language in the sense that there are quite a lot of models, for example, N-drums, Markov chains, Hino-Markov models that can come up with a probability for a given word, given a previous word, things like that. There's no such thing as straightforward for the case where we have unconstrained polyphony, where we don't know how many concurrent notes there exist in the recording. Up to recently, where a few people like Bruno Angelo Lewandowski came up with deep learning-based methods for modeling polyphonic music using recurrent neural networks. One approach that was done when I was back at City University in collaboration with Queen Mary University in London was to use one of these recurrent neural networks which act as a music language model. And connect them with this PLCA-based automatic music transcription model in the sense that we can use the music language model as prior information. What we found out, basically, we did the transcription and then we did a prediction step using the language model and then we fed that prediction back to the transcription step was that we found out there was a significant improvement and the music language model was able to correct a few of the mistakes made by the acoustic system. For example, semitone errors or some random notes appearing out of context. And this approach was further continued by replacing the acoustic model now with a deep recurrent neural network and also by coming up with a more principled way of merging the acoustic and the language information. This actually tells you that we have the observations from the acoustic model and we assume that these are generated by a music language model. And the language model is connected over time using a variable called the Markov chain. And that was this hybrid acoustic and language scheme was able to also come up with a significant improvement in terms of transcription accuracy. And the final approach that we came up with was that the acoustic model was replaced now by a deep convoluted neural network. The advantage of a convoluted neural network was the fact that we now could take into account the temporal context in the acoustic signal so we could take into account the temporal evolution of a note and join that with the language information. And the results, the system was evaluated when we integrated the acoustic model with the language model. The final part of this talk is about applications now of transcription. So apart from the core application that I just presented, I have done a few collaborations. The first one was with Andrej Holzapfel who is a computational ethnomusicologist on how we can use these methods to transcribe non-Western music. So we came up with a system that was able to disregard this assumption of 12-tone equal temperament and come up with a method that could transcribe microtone music. And we applied that to the Macam music of Czech music. Another application in the field of computational musicology was also with Tomáš Vajga and Samuel Dixon where we wanted to estimate temperament in harpsichord recordings. So temperament essentially is let's say a tuning configuration note by note which can indicate mood. It was quite popular in early music although nowadays it is less and less used. Using automatic transcription of harpsichord recordings we were able to come up with a high precision frequency estimation in order to come up with temperament estimation for these harpsichord recordings. Another application that Tilman will present later in more detail was the fact that these transcription-related features can be used to explore and visualize music collections. Through this digital music lab system transcription-related features were able to be used to estimate tuning for various music collections or to come up with which notes are prominent in collections of recordings. A more recent paper that will be presented a couple of weeks from now at the ISMIR conference was again a collaboration with Andrzej Holczak on combining automatic music transcription with beat tracking. So one of the drawbacks we had so far is that the system is able to produce a representation but the representation was not directly convertible into a staff notation. By using beat information we are now able to quantize the detected notes on the system and come up with some sort of staff notation that can be interpreted by humans. And now we also come up with another interesting problem which is about how to compare an automatic staff notation with a manual staff notation which is future work on how can we compare a manual staff notation. And a final application that goes beyond music and is directed to the other half of my research nowadays is that this transcription based method was successfully ported to another application area which was on sound event detection. So on creating systems that can recognize every type of environmental sounds you see here for example a spectrogram with system outputs for detecting various types of office sounds like speech, door knocks, door slams and so on. So these kinds of matrix convolution methods can go beyond music and can also this is also something that came up during lunch is that these methods can be used also to come up with or detect any repeated patterns in everyday sounds. So to sum up I hope I was able to convince you that automatic music transcription is a sort of enabling technology in the field of music information retrieval and that these matrix decomposition methods can lead to systems that are interpretable in the sense that we can also be extended and also computationally efficient. There are quite a lot of emerging applications of automatic transcription in fields of acoustics such as instrumentation, measurement, also in performance science, music education so that transcription can be used for example in order to come up with systems for automated instrument tutoring. However, it's still like an open problem for an end user and one of the reasons for that is that most of the work for automatic music transcription is coming up from the fields of signal processing and machine learning and there's input needed from other disciplines such as music acoustics, also music perception, cognition, musicology in order to come up with a fully functioning input representation. So what is the input of your system? There is quite a lot of work in the fields of psychoacoustics on representations derived by computational models of the human auditory system that can be investigated. Music language models, I did present some work on that but the field is far from being fully addressed. So how to track multiple concurrent acoustic events over time? And then also how to adapt the systems to different conditions. Let's say that you have a transcription system that might work well enough for studio recordings but if you make a recording using your iPhone is that good enough? If you make a recording using your iPhone is that good enough? And also there is the fact that there is quite a lack of dissemination in terms of code and data in the community and we need to do something about that. These challenges and many more are also included in this EU roadmap for music information research that was published a couple of years ago.

Speaker 3: So you were talking about the field of transcription doesn't have really any unified theory of what's the best method to use, where's the best direction to go. Do you think that these matrix factorization methods are that way to go? Are there drawbacks to it? Are there ways that other ones are better?

Speaker 2: Well I'd say that maybe for now the majority of methods being used are based on these matrix factorization methods. There are some drawbacks in the sense that in order to incorporate the full extent of prior knowledge that we want to incorporate you need to come up with really complex rules. So the problem with these methods is that you need to basically you need a piece of paper and a lot of time to come up with a big equation like this to put all the knowledge you have in. That doesn't cover anything. Whereas sort of deep learning methods it's more of a black box thing you just hope that it will work if you have quite a few layers and inputs. In my opinion these are promising in the sense that they cannot only be used for solving a specific problem but can be used to detect to do quite a lot of things. For example this one here can be used also for source iteration. It can also be used for detecting instruments. It can be used for deep browser detection and things like that. So I think that transferred learning argument is quite significant in that respect for these types of methods.

Speaker 3: So this one will be better at doing say instrument detection than one that was a discriminative model that was just for instrument detection.

Speaker 2: So

Speaker 4: discriminative models they're not

Speaker 2: pretty successful for detecting instruments in polyphony music because they use this sort of back of features approach. So we compute various types of features you put them into some sort of classifier and you hope that it will be good. I think in order to do proper what we call instrument assignment so to detect every note and for each detected note to assign to a specific instrument you need to go beyond simply discriminative methods and come up with some sort of joint source operation method that tries to separate the instruments and then to identify

Speaker 5: the instruments. I saw that some summations sum over the number of events or the number of instruments. How would you determine that?

Speaker 2: Yeah so in this model the number of instruments is predetermined. We don't know how many instruments exist in the recording but we do have a dictionary of I think in this case I had the model of 20 different instruments which can be present in the recording like piano violin viola cello and so on so on. The system is able to detect which instruments exist in the recording and we do have a predefined set of trained instruments beforehand. And we also have a predefined set of possible notes. So this model covers a set of 88 possible pitches semitone scale from A0 to C8.

Speaker 5: In terms of generalization because you are using templates you can generalize if you have some templates of certain piano you can use another signal from another piano or another channel.

Speaker 2: So one of the interesting things about this model is this probability here which tells you the probability of an instrument being present given a note at a given time frame. That essentially tells us that we can approximate it we have a new note coming up from an unknown instrument and we can approximate it as a sort of linear combination of notes from our dictionary. So we can approximate a new piano by adding different pianos we have from our dictionary.

Speaker 4: So in the transcription you are doing you are not using any sort of overall structural features like a time signature or a running time signature and do you think that would help you in particular with the problem of notes dropping out because one of the things that we know about human musical understanding is that events that are very predictable in terms of difficult frames they tend to be reduced and I anticipate that some of your difficulties with notes dropping out might be helpful do you think there is

Speaker 2: any to that? Absolutely there is one difficulty in coming up with a model that can but there is also the problem that in sort of real life audio recordings the recordings are not that clean so whereas the intention of a performer might be to come up with an exact time signature maybe the resulting audio recording doesn't follow that time signature very rigidly or that there is also remembration echo

Speaker 3: note fixing that sort of smear out there. One of the cool things it seemed about the NMF method is that it can sort of if you're as you're doing the factorization you can almost come up with you know your own dictionary so is that something that you think could be done almost in real time as you're listening to a recording without the pre-trained dictionary and are there more difficulties

Speaker 2: there? There are quite a few systems that try to do that but there is a difficulty in the sense that these models are too rich essentially and you need to come up with quite a lot of constraints in order to make them work so in order to learn the dictionary just from the recording itself you need to specify that each element in your dictionary needs to be harmonic for example and that the envelope of the clarinet is not really smooth for example or that the templates for a piano recording for a piano are not actually harmonic because the piano is inharmonic so you need to incorporate all these various rules and then it gets complicated

Speaker 4: is there room for using spatialization for instruments and stereo recordings

Speaker 2: to further isolate

Speaker 4: different instruments

Speaker 2: playing? absolutely this is completely unexplored at the moment I'm not aware of a single system that exploits stereo information or spatial information so there's quite a lot of room for research in that thanks very much

Summary

Generate a brief summary highlighting the main points of the transcript.

Generate

Title

Generate a concise and relevant title for the transcript based on the main themes and content discussed.

Generate

Keywords

Identify and highlight the key words or phrases most relevant to the content of the transcript.

Generate

Enter your query

Submit

Sentiments

Analyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.

Generate

Quizzes

Create interactive quizzes based on the content of the transcript to test comprehension or engage users.

Generate

Back

Forward

{{ Math.round(speed * 100) / 100 }}x

Select Audio file

Convert Your Audio To Text

Secure and Encryption, NDA

4.9/5 3727 customer reviews

1/732

Verified Order

“I haven't used the customer support yet, but the interface, guides, and easy access to the contact buttons are promising. The output is also really accurate and well-executed:)”

keziah

Aug 15, 2025

“Service is very fast and easy. I noticed a few errors but they were minor. I like your service.”

MICHAEL TRENT

Aug 12, 2025

“Excellent service!”

DanutM

“Excellent service, thank you very much!”

Samantha Cava

Aug 11, 2025

We Trust in Human Precision

Value-Driven Pricing

Trusted by Global Leaders

GoTranscript

24/7 Customer Support