Exploring Music Transcription and Understanding: Insights from Tijana Mihajlović

Convert Your Audio To Text

4.9/5

3727 customer reviews

Tijana Mihajlović discusses her journey in music transcription, challenges faced, and advancements in AI for music understanding, including jazz pattern analysis.

Simon Dixon - Automatic music transcription and music understanding

Added on 09/06/2024

Speakers

Add new speaker

Speaker 1: Tijana Mihajlović Reviewer's Name Reviewer's Name Good evening, and thank you for the invitation to speak here tonight. It's certainly a great honor for me to speak to you at this distinguished lecture. I don't know how distinguished I am, but certainly I would, you know, I look forward to the opportunity to tell you a bit about the work that I've been doing over the last couple of decades, and it's, obviously I'm not particularly gifted because I'm still working on music transcription after 25 years, so I haven't solved it yet, but maybe one of you can solve it for us, and then I can retire. Okay, so having done... having done that, to describe the work on transcription, I'd like to go on and look at the, a much harder problem, that is music understanding, and think about how transcription relates to how we, or how a machine can understand music, whatever that means. And I don't want to get into any philosophical debates about artificial intelligence and whether machines can really be intelligent or whether they can really understand music. It's, I'm much more of a pragmatic person who just thinks, if we can build a machine that can understand music, if we can build useful tools that, whether they are or aren't intelligent, it doesn't matter, as long as they appear to be, they do what we want them to do, we can be very happy with that. Okay, so let me look at what I'm doing with... where's my cursor? Oh yes, I'm over there, good. So, as an outline, I'm going to speak briefly about my background, just how I came to get into this field, because when I was young, it didn't really exist, and there's very few people in the world who were doing what we now call music information retrieval or music informatics. And so I kind of stumbled my way into it, and I'm going to talk about a few different methods that I've used for music transcription, and then the applications, as I said, to music understanding. And particularly, one particular one I want to show you is some work on the analysing melodic patterns in jazz improvisation. Okay, so my background, as Ichiro mentioned, I have a PhD in, well, he didn't mention this bit, I have a PhD in artificial intelligence, in logic, belief revision, if that means anything, non-monotonic logic. It was kind of big when I did it, but nobody cares about it now. It's been replaced by neural networks. And I was also a professor at the University of New York, and I was also a professor at the University of New York, and I was always interested in music, I haven't played guitar most of my life. And I got the opportunity in 1999 to combine kind of these two halves of my life with the music side and the technical computer science, artificial intelligence, when I saw a job advertised in Vienna for a postdoc in AI and music. And I hadn't really... It hadn't really clicked to me that these things belong together, that I should be combining both sides of my life. I'd kind of... I mean, because there wasn't much academic work in music technology. I mean, the music technology stuff was a different type of... It was how to use technology in composition or performance, but it wasn't about the type of analysis and analytical work like automatic transcription and things. It was about the type of analysis and analytical work like automatic transcription and things that I'd been working on. So I managed to combine these interests, and I moved to Vienna and worked for seven years and established... basically managed to establish myself as a researcher in this area, and then I got a faculty job in London at the Centre for Digital Music, where I've been now for just about 16 years. And I'm now the deputy director of the Centre for Digital Music. And I'm now the deputy director of the Centre for Digital Music. And I'm now the deputy director of the Centre for Digital Music, which is about... I think we've got about 100 people working in kind of the general kind of music technology area. I think we've got about 100 people working in kind of the general kind of music technology area. And most of those people are... I mean, more than half of them, certainly, are the PhD students in our doctoral training centre, which is on AI and music. If anyone's looking to do a PhD, we will be taking applications for next year, starting... So we'll be advertising in a couple of weeks' time, so look out for that. So we'll be advertising in a couple of weeks' time, so look out for that. If anyone's looking for a faculty job, we need more faculty in our area, so please come and speak to me. Lots of things we'd like to do, and not enough people to do them. So, why work with music? Here I am standing in the hall of a music department. It doesn't really make sense to ask this question. I don't need to ask or justify my existence as a music... something researcher. But when I write a grant proposal, it's always good practice, isn't it? Right? To think about this. Why work with music? We have to convince the rest of the world, even if we don't have to convince ourselves. And maybe some people here stumbled in accidentally and think, why does anyone work with music and what are they doing with music technology? So I thought about, you know, different reasons why music's important, and perhaps, you know, I might be preaching to the converted here, but, you know, you can't put in a grant proposal that, well, I really love playing guitar, so give me six million pounds for a research centre on AI and music. It doesn't really win the money. So, firstly, music is immensely important in our culture, in all cultures, in fact. It's universal. Every culture has some form of music. Now, certainly, those forms of music are not easily comparable. Right? They're extremely different. But music certainly exists across the world as kind of something that everyone relates to, enjoys, and it gets used in many different ways in our society. And I've listed some of them here. If you ask people, people have done these types of studies where they say, what do you use music for? What's its function? And people reply, the number one answer is for emotional regulation, which probably means it makes me feel good. Right? Or maybe they use it, I want to feel bad, I want to feel angry. I want something that wakes me up, gives me some energy to get going in the morning. So that's one of its functions. But it has another really important function, I think, and especially in these times of very divisive politics and a lot of people trying to get parts of society to kind of basically see other parts of society as their enemies. And music brings people together. It does the opposite. Right? And historically, it's been a really powerful force. You think about in the USA that people were playing jazz together, black and white people were playing jazz together, long before they could sit together on a bus or go to the same universities. Right? So it had a powerful force and it brought people together. I think it still does the same thing today. Of course, we use music. I mean, who would go to a party with no music? It would be a very rare thing. We use it in our kind of recreation. Obviously, it's used from very kind of high and lofty things such as worship, but it's also used in advertising. Right? It's used to get your attention. It's used to create something memorable which gets stuck in your brain, whether you like it or not. Sometimes it's good, sometimes it's bad. And so music has many functions. I think that's an important thing to understand. And then, of course, when you're talking to funding agencies, it's probably more useful just to skip to the next point and say there's lots of money in music. And, I mean, clearly, there's always been a large music industry and lots of money does go into music. It doesn't often go to the musicians, unfortunately. But somebody makes money out of it. And these days, what's changed during my career is that the music technology, the type of thing that I do, now there's a lot of industry involvement in that field. When I started, I remember meeting Roger Dannenberg in the 1990s. And... talking about... I mean, everyone who came up to Roger said, so how do you get money to do this type of work? Where do you get funding from? Because there was no industry in the area. Industry was fighting against technology in the 90s. They were doing everything they could to stop any technology from advancing in the music industry. Right? Because all they saw was it was a threat that people can copy digital music. Ooh, dangerous thing. People will copy all the files and then they won't buy music anymore. Um... So... people were looking and asking, well, you know, how do you... how do you get funding to do research? How do you... Because it didn't exist. Nowadays, it's kind of obvious. You just say, well, what type of work do you do? Music technology is like, oh, yeah, you know, like Apple, Spotify. And you can list 100 names like that. And, you know, lots of companies that didn't exist 10 years ago that are now worth more than... more than you can imagine. So... So... obviously, there's a big industry there. Um... But... maybe the more important thing for me as an academic is that the academic side to it, the, um... the opportunity and the challenge. One side, there's the challenge of... how do we model human intelligence and creativity, which is a really important part of intelligence. So we're good at modeling kind of how people play chess. And... And even language, now, we're starting to solve. But... creativity is still a fair way off. And music's a really good field to... in which to... to study creativity. Um... From a more kind of down-to-earth perspective, as an engineering point of view, that these kind of... the complexity of the signals and the... the fact that music is multidimensional, it's not like speech, where it's just one thing happening at a time. It's... again, it's just, it's one thing happening at a time. It's... multiple... different sources of sound playing together that make it very complex. And... Um... And kind of a... a... technical challenge, at least, to analyze music. Um... But it's an opportunity, because we're living the days of big data. And... Big computers. Lots of them. We can destroy the planet by using lots of... lots of energy large artificial intelligence models, right? So we have a problem there that we don't really... We have to balance what we can do with what we should do. But there's an opportunity there to apply the technology, which is now incredibly powerful, to learn and understand things about important cultural artefacts such as music. And so that's what we're looking at. OK, so that brings me to automatic music transcription, which I guess in the music department... You all know what music transcription is, as Ichiro said. OK, dictation, you might call it. But you hear some music and you write it down in some type of notation, something like, down the bottom here, a musical score. When we do it on a computer, we do it slightly differently. We would generally compute something like... like this time-frequency representation. This is a spectrogram as an intermediate form. And then there we want to detect notes. And what we see here is on the horizontal axis is time. So this is time going from left to right, just like in music notation. From bottom to top is frequency, kind of like musical notation. But here we're talking about frequencies. So even if you play a single note, you don't just have one... one horizontal line here, but you have another one here. And these represent the partials in the harmonic series of the notes. So if it's a pitched note, it will have not just energy at one frequency, but at a whole lot of frequencies. And so usually the first step we want to do is condense those partials down to individual notes. And here we have what's called a piano roll representation. And so we want to basically detect where are the horizontal lines, where are the horizontal looking lines and where are there's a whole set of parallel ones that start and end at the same times but are spaced evenly in frequency, which would be a harmonic series, and they would all get condensed down to two single notes that gives us a piano roll representation and if you're an MIR researcher you might stop at that point and say well that's solved, I found out when the notes, what pitch they have, when they start, when they end, that's all we need to know, we know what the musicians play. But of course most of us are not very good at reading piano roll representations and if you were given that to play you might find it a bit difficult to sit down at your instrument and play it. So how do we go from there to the musical score? There's a huge amount of work involved in interpretation because we have to work out firstly what key are we in? What's the time signature? What's the tempo? Once we know those things we can then start naming the pitches that we've recognized because you have to know is it a C-sharp or a D-flat, you have to know is this note on the beat or is it off the beat, is it on the first beat of the bar or on the second? So we have to convert time in seconds to time in beats and measures and so on and frequency in hertz or it's been converted to MIDI pitch but we have to represent that in musical units as sharps and flats and those types of things and not just as MIDI note 61. Okay so that's what the task is. Why is it difficult? Well I like to make the comparison with speech recognition because speech recognition is something which engineers have pretty much solved and they've been working on for many decades. Huge amounts of money have gone into it and huge numbers of publications on the topic. But speech recognition is kind of easy. One person speaks at a time. Sometimes there's some background noise, maybe even a cocktail party but still you're focusing on one thing. In music there's multiple simultaneous sources and they're all important because they're all part of the music. So generally we want to know not just what one person is saying at the cocktail party but what everyone's saying at the same time. But unlike cocktail parties where people tend to start and finish words at different times to each other, in music the musicians rehearse so they all start and stop notes at the same time together. So there's not this kind of thing that one person starts a word, somebody else starts a word, somebody else starts a word all at random times and so there's no correlation and that makes it easier to separate. In music they all go together. They play a chord. It's like well what's the difference? How many notes was that? It sounded like one. They all blended together. Well that's nice aesthetically but it's actually technically a pain because now it's hard to separate them again. So all of the things that work to make music sound pleasant to us also work to make it difficult to then pull apart and to recognise the constituent elements. And so we have this correlation in time, we also have a correlation in frequency which I've drawn here on the diagram. The diagram on the right. So there's a reasonably standard voicing of a C major chord, at least one that I would play on a guitar if I was playing a bar chord, it would be exactly those notes. And I've marked the frequencies of all of the partials in the harmonic series up to however many frequencies fitted on the axis, up to 1,600 hertz and what you see is that a lot of these frequencies overlap. So you can see that the frequencies overlap. They line up with each other. So when I play one C and I play another C an octave higher, in fact all of the partials of the higher C align with the partials of the lower one, they overlap. So in fact there's no new frequency components whether I play two Cs together or just the bottom one. The frequency components which appear in the spectrum are exactly the same, it's just that, perhaps the even ones are a bit louder because you've played an extra note. But it's quite hard to separate these things unless there's some extra information we can use. And so music has these difficulties which make it an extra challenge to separate. But that doesn't discourage us. It's just that. Yeah. Because we all love music, right? So I'm going to talk about just a few different systems which I've worked on over the years. The first of these was my first PhD student, Matthias Mauch, who worked on chord transcription. So his goal in his PhD, which he succeeded at, was to create a system which could work out what chords were being played in a piece of pop music. And basically output them. And it should be kind of in a similar way to the way a musician would write them out. So that's what we thought kind of intelligent, was the word we used, but we thought what does intelligent mean? Well it means kind of like a musician. So it would do what we want it to do. And so what I've shown here at the top is just what you would buy if you went to your local music store. And bought the book of some Queen songs. I think that's a line from a Queen song. This is the output from our system. The important part being this sequence of chord symbols. And the G, D over F sharp, E minor is exactly this line here. So it got that bit right. The notes underneath don't actually mean anything. They were just automatically generated from the chord symbols. So our aim was to create this automatic transcription. And automatic transcription of chords is on one side an easier problem than the note transcription I was talking about. Because you don't need to know every note that's played. You can work out the chords without actually working out every single note that's being played. There's a much more difficult element to it. And that is the interpretative element. You have to work out which notes which have been played have a harmonic function. And which notes are just passing notes. Or auxiliary notes. Or whatever else you want to call them. Non harmonic notes. Right? Notes that don't belong to the harmony. How do you tell the difference? Well, if you know what chord it is, it's really easy. Because if I say this is a C major chord and I play the notes C, D, E, F, G. Then you know the D and the F are passing notes. But if I said this is a D minor 7 chord and I play the notes C, D, E, F, G, A, B, C. Then suddenly it's the E and the G which are passing notes. And not the D and the F. Right? So which comes first? It's kind of a chicken and egg problem. That there's no easy way. Now of course you can say, well of course it's the ones on the stronger metrical positions. But it's not. It's the notes. The ones on the stronger metrical positions are more likely to be the chord notes. And yes, you can use those types of things to work it out. But there is this step of interpretation which isn't required when you're doing just note transcription in order to work out what the chords are. So the method we used was a statistical modelling approach called dynamic Bayesian networks. This is a type of hidden Markov model. I'll explain that a bit more on the next slide. Where we try to integrate the musical context into the estimation of the chord. The idea being that it's not just kind of take a very small slice of audio and see which notes are played and then label that with a chord symbol. Right? There's a lot more that goes into it because it depends what key you're in and what the previous chord was and all those types of things. That will influence the way that it should be interpreted. Right? So what we did was we calculated something like the time frequency representation I showed on a previous slide, the spectrogram. In this case it was a log frequency spectrogram because musical pitch is logarithmic in frequency. We tuned it to, because not everyone plays at A4 as 440 hertz when they record music, and even if they do maybe when it gets transferred from the tape to the CD then it gets transferred at the wrong rate or something and it gets a bit sharp, a bit flat. So we tuned it so that these basically our grid, our frequency grid then lined up with semitones. And we also applied a beat tracker. Itch was talking about that before that I did this work on beat tracking. So we used a beat tracker to work out where the beats are in the music. Right? And then just for each beat we calculated one chord because chords don't change more than once per beat usually in pop music. Usually they stay the same for a few beats at least, if not many. So this is called beat synchronous features. So the features we calculated were for each beat in the music. And we had this approximate transcription method. Which was actually very simple. It was a built-in function in MATLAB, the language we were using, that gave us kind of an estimate of which notes were present. Basically that mapping from the frequency representation to the pitch representation on the previous slide. So can I explain to you what a dynamic Bayesian network is? Well probably not in the time we've got. But it's a type of hidden Markov model, which is a type of graphical model. Where basically you have these variables, which are described by circles. And you have arrows between them, which represent dependencies between the variables. So if one of the variables has links to it from other variables, it says that that's dependent on these other variables. So depending on what those values the other variables have, that will influence the current one. And so in this work we have... Four variables that we estimate. One is metrical position, one's key, one's chord, one's the bass note. And the other two variables here, the bass chroma and the treble chroma, are things which are derived from the audio signal. So this is the transcription step. Basically we just kind of take the low-pitched notes, go into the bass chroma, and chroma just means we throw away which octave it is. We just want a pitch class. And don't worry about the octave. So we have our observations for the model, this bass chroma and treble chroma. And the variables we want to infer are these four. And metrical position is very simple. It's a number from one to four. Because music goes one, two, three, four, one, two, three, four. Remember we're counting in beats. Each observation is a beat. And we're dealing with pop music. It always goes to four. It never goes to three or six or ... Oh, it's a simplification. Okay. It's a probabilistic model. So we have some very small probability that it doesn't do that. It does something else. But basically, the assumption is nearly always right. It goes one, two, three, four. So we made lots of simplifying assumptions. But it was an early attempt at this. The key. So we try to infer what key we're in. Okay. which chords are played and the bass notes we get from our observations. So if we look up here at this metrical position, so these I should say the two columns are time steps, so this is kind of the previous time step, what values these six variables had, and then this is the current time step and so that at the current time step the metrical position depends on the previous metrical position. So if we were on the second beat on the previous step we are now on the third beat of the bar, very simple, and there's a very small probability that it could be a jump back to the first beat and started a new bar because it was a bar of 2-4 or something. As another example let's say the chord, this is probably the most interesting one, is dependent on three things, it's dependent on the previous chord, why? Because most of the time the current chord will be the same as the previous one because most chords last for more than one beat, right, very simple. It's also dependent on the key because if the chord does change it's likely to change to one which belongs to the present key, right, so a diatonic chord it's more likely than a chromatic one, and finally the metrical position. Now how on earth can a chord be dependent on the metrical position? You say well you know do D minor 7 chords happen on the second beat of the bar? No, that's not the reason, the reason is because the chord is very likely to be the same as the previous chord unless it's the first beat of the bar where it's slightly less likely because on the first beat of the bar it's more likely chords change. So this is kind of the probability of the chord changing is influenced by the metrical position, not the actual chord itself, if that makes sense. So that type of intuition is built into this network, and it allows the network then to magically do whatever Bayesian networks do. It does its reasoning and comes up with the most likely sequence of chords that explains the data that we've observed. So we've observed, that's kind of the way these models work, they're generative models, so this is a kind of model that generates music, but we're not actually generating music, we're analysing it, so we run the model backwards. We work out, well, which sequence of chords would be most likely to have generated the sequence of music that we had on our recording, and that's the way the models work. Okay, so that was our attempt at transcribing chords. And in... At its time, it was the best. We put it into the public evaluation, the MIRIX campaign, and it came out best for a year or two, until somebody else took over, of course. My student graduated, and somebody else was the best. But we enjoyed it while we could. So the second example I'd like to give of a transcription system is work that another PhD student did, Emmanuel Benitos. And this was based on the idea of... Non-negative matrix factorisation, which is basically the idea that you can take the spectrogram, that time-frequency representation, and you can factorise it into two different matrices, one which is kind of a dictionary of all the possible sounds which occur in this piece of music. When you think of piano music, this makes a lot of sense, right, because there's only 88 keys you can press, and so there's 88 sounds you can make. And then the other, so you have... One is the kind of this dictionary of the 88 sounds you can make. And the other one is the activations, that is, when are those sounds made, and when they were made, how loud were they, basically, right? So they're kind of amplitude and time of the activations of what we call the templates, this dictionary of sounds. So you can think of it as, you know, piano, or you can think of the sounds as being your synthesiser, and the activations then as being your sequencer, telling the synthesiser when to produce which note. And that can be done for transcribing a single instrument or for many different instruments, because you can have those, your templates can be not just from a single instrument, we have many templates for one pitch for the different instruments, or for even different playing techniques on the same instrument, and so on. The only kind of downside to this is as you add more and more different things that you want to... To allow the system to use to describe the signal, what invariably happens is that although any one instrument played one note, it will try to explain the one note as being a combination of five different instruments playing that one note, all with different amplitudes, to get something closer to the timbre of that particular instance, because generally the timbres don't match, because, well, not every saxophone is the same as every other saxophone. Anyway, so this is the, the model is that we're trying to do this factorisation into templates and activations, and what we did in our work was we extended this model in a, well, using firstly shift invariant, which means that we reuse templates across different pitches, because a C and a C sharp are very similar on the same instrument. Except that they're, apart from the pitch, of course, but the timbre is very similar, so we don't need to have unique templates for every single pitch that can be played, we can just take a generic template and shift it in pitch, and that allows us not just to shift it by a semitone in pitch to go from a C to a C sharp, but also if we play the C with vibrato, or we do a pitch bend, or whatever, we can also represent things between pitches, so that was something that hadn't been done in the past, but it was something that we did in the past. In this type of work at the time, so we were looking at being able to transcribe music with vibrato and pitch bend, and this worked okay, and I can play an example of it, but I can't play it by clicking this, because my computer doesn't allow that anymore. It says it's a security risk if I play files from my, straight from my presentation, so I'll play it from here. So that's Lyra music, if you've never heard, or Lyra, a Greek instrument, looks, well, it's a bit like a violin, but they play it like this, kind of like a cello, and here's our transcription of the same music. Hopefully it sounds like it is the same piece of music at least, I'm sure it didn't get every note correct, but... It sounds pretty awful, because we're just synthesising it on a MIDI synthesiser, but it captures the music, it worked reasonably well. And so that was one of the better techniques for transcription at the time. The next part of transcription I'd like to mention is some work we did, this is again with Matthias Mauch, after he'd finished his PhD and came back and worked for a couple of years with us. And this is work on a simplification of transcription to what should be really easy, that is monophonic transcription, one note at a time. This should be solved, right, it's so easy. Commercial products do it, yeah, they do it pretty badly. So, yin was the kind of the state of the art in the world of speech recognition, in the kind of the 2000s. So, yin was the kind of the state of the art in the world of speech recognition, in the kind of the 2000s. And so, we used it sometimes for some of the work, we were doing work on singing, and it was okay, but it wasn't quite there, so we looked at how we could improve it. And basically, there were two problems with yin. One was that it relied on the choice of a threshold, which would often be... Well, basically it would work for... What works for one instrument wouldn't work for another one, or for one recording wouldn't work for another one. So you just, you couldn't get reliable results out of it without a lot of hand... kind of handholding, which is no good when you're doing big-data type of work, the type of thing we were interested in. The other problem it had was it required post-processing. So the output it produced was reasonable, it would be... most of the time it got the correct pitch but there were a lot of kind of incorrect kind of extra notes at transitions, where notes begin and end, where there'd be kind of just one or two frames where it's an octave higher or an octave lower than it should be, or that type of thing. So we looked at how can we improve that, and we developed, again, a probabilistic model where we had basically introduced, rather than just having a single threshold value, we'd say, well, we don't know what the correct threshold value is, but we know it's in this kind of this range, and it's basically defined a probability distribution for it. And then computed, based on that, the most likely sequence of pitches, given this prior distribution of threshold values, and basically we're calculating then for all possible threshold values in this range, what would be the most likely output, the most likely sequence that would have produced the output we're observing. And it turned out that this gave us some really nice overall results. It was more robust and less sensitive to thresholds. So I'd like to give you a demo of it, if I can. Let's see. Just to show you what it looks like. So we produced this system called Tony for analysing intonation. And I'm doing a little bit of a test. And I'm doing a little bit of a test. I'm doing a live demo. This is really dangerous. This could fail. All right, so we load in a file of some random person singing. I won't tell you who it is, but it might have been one of the authors of the paper, and it's not me. Da-da-da-da-da-da-da. Da-da-da-da-da-da-da. So in the time it took me to load that up, it's loaded it, it's extracted the pitch track, shown in black, and it's put blue boxes around all of the notes. And you can just look at them, and you see this note was sung with a frequency of 136.968 hertz. That's its median, because as you can see, when people sing, the pitch never stays fixed for any length of time. There's no such thing as beautiful horizontal lines as in piano music. Okay, so that's a really nice tool. And if you don't, if it makes a mistake, such as it merges two notes together, you can just kind of say, oh, I knew that. That wasn't one note, that was two notes. Yes, split it into two notes. So you just can split it by, oh, no, sorry, select edit first. And you split it into two notes somewhere in the middle. And then each of those notes then has its own median frequency and so on. Or if it's the other way around, you can merge them together. And if you want to extend them, if you say, oh, this one wasn't long enough, you can extend it out. Or you can say, no, that's too long. That bit at the end wasn't part of the note, and so on. So it's very easy to edit and fix things if you want to do an analysis of the way people sing. So we did a couple of papers on how people sing out of tune, which was all lots of fun. So that's a system called Tony. I don't think I want to save it. So now we move on to the modern days where suddenly everything we do, everything we used to do gets thrown away. And now we just do everything with neural networks, right? So I had a student in, I think he joined in 2012 or 13, who was very eager to work with neural networks. And it was very brave in those days because none of the software worked. It was awful. You had to write everything yourself. Nowadays, you just use Python. You just use PyTorch or TensorFlow or these frameworks that just do everything for you. I think by the end of his PhD, he said that, I said, oh, this is, he got some really good results. And I said, we should see if we can commercialize this or do something with it. And he said, no, no, no, throw it away. You could do this in a few weeks now with them. It's a new software that's come out. Yeah, oh well. So what we did in, as our kind of first foray into, working with neural networks, was to follow basically the approach that's taken in speech recognition. And that is to divide the system into two parts, an acoustic model and a language model, where the acoustic model has the job on speech. It's the acoustic model kind of recognizes what phonemes are produced. And then the language model chooses kind of sequences of phonemes that make sense, correcting the ones so that a sequence of, a likely sequence of words comes out. And this works really well in speech recognition, as anyone who uses Siri or whatever you use will know. So can you use the same approach for music? Well, the acoustic model is going to work out what pitches are played. And then the language model tells you what it's kind of, what's likely to happen. What's likely to happen in music. Which combinations of notes occur, which sequences of notes are likely to occur. This was very hard to learn, at least with the models that we were using. It gave us a very small improvement, maybe out of the results we have here, which were, I mean, we got great results in the sense that we were, you know, five, at least five percent. In fact, these are, these three results are all our work. There were three different models that we used. So the previous work, including my student, Emmanuel Benitos, improved it by about ten percent. But only about two percent of that improvement was due to the language model. So there's very little help. Most of it was the acoustic model. And in fact, in further work that other people did, built on, building on this, they showed that they could do even better than we could with no language model, just by fine-tuning parameters and doing things that machine can't do. But machine learning engineers do. The other thing that's interesting in these results is that the drop in performance when you go from the same piano to a different piano. So this means that the neural network has been trained on examples, recordings, other recordings from exactly the same piano in the same room, same recording conditions. So it knows what that piano sounds like. It knows what a middle C played on that particular piano sounds like. And so it can recognise them very easily. And if you play one on a different piano in a different room, suddenly performance is ten percent worse. That's a huge drop. But it's been... I've seen it consistently in all types of work that we've done that when you go from kind of training on known instrument sounds to unknown ones, in this case going from synthetic to real acoustic pianos, and piano synthesizers sound pretty good, right? They're not that different, are they? If you're a computer, they are. So there's a big gap there if you want your system to work on any recording from anywhere. On the other hand, if you're working with a particular... You're working on concerts that happen in this hall on this piano, then you can tune your system to work really well for this piano, and it will. And I fear that a lot of the brilliant results we're seeing in music transcription are in fact more of an effect of the fact that they're often learning on just one piano. And of course, yeah, it's getting really good results, but it's not going to work well in the real world. So quickly, two things I've had absolutely no involvement in at all, but I wanted to mention them because they're kind of... They're the next steps in music transcription. And they've come from... They're from industry, basically, Google. So once Google sends you an email and they say, oh, we just saw your paper, we found it really interesting, you think, oh, great. And then they say, and we're going to reimplement it and do something better. You go, oh, no, they've got more resources, they've got more data, they've got more computers. I can't compete. I've got no chance. So I should have retired at that point. But they didn't offer me a job. Oh, well. So they came up with a system which is not too surprising in the sense that it's still got this convolutional thing which is doing what we would call the acoustic model. And then the recurrent part of the network, the LSTM part, is a bit like the language model. It's doing the smoothing. But the really clever part about it was that they did this multi-task, where they said, we don't just want to learn what notes are being played, we want to know when are the onsets of new notes. And we're going to train this as an onset detector as well as training it as a transcriber. And then the transcriber isn't allowed to have any new notes unless the onset detector says there was a new note at exactly this time. And so by training these two tasks at once... Because onset detection works quite well, a lot more reliably than our transcription systems. And so... So by doing that, they were able to kind of link these two things together and produce a system which was a lot better than what we had done not many years before that. And then Chi Chang Kong, who's at ByteDance, although this work was really done at Surrey University when he was a PhD student, took it one step further, I think. He wanted to very precisely estimate not just the rough onset time, but the exact onset time. So even if you're calculating frames of data and you want to find kind of the frame, which frame did this note start in? Now, he says, no, I don't want to know what frame, I want to know how far between this frame and the previous frame was it, exactly where did it start in milliseconds. OK? And he also wanted to measure the pedalling and the velocities. And... he put together a system which is more complex because it's now got four different neural networks all working to calculate their own part of the equation, the onset times, the offset times, which pitches are present in a frame and the dynamics or velocity in midi terms. And if you remember the results I was getting, 74% five years previous to this, 96.7% but this is on Maestro and this is one piano. So I say I don't trust it. In the real world, it's not as good as that. Google's one, 94.8%. So he managed to outdo Google, which is always an achievement. But that's kind of the state of the art in piano transcription. The last thing I wanted to talk about in terms of transcription is... work that a recently graduated student of mine did on lyrics transcription. So this is a totally different thing. Rather than worrying about what notes have been played, what about what words have been sung? You say, well, just use Siri. That works perfectly, right? Try singing into it. Somehow, when people sing, suddenly they pronounce words very differently. They pronounce words much longer than they would otherwise. They also have all these instruments getting in the way, making noises that sound like consonants, right? Those snare drums and cymbals and things. And so any system you try on it just doesn't work. So we developed an approach using what is kind of considered old-fashioned, these hybrid DNN, HMM architecture, because these days everyone wants to do end-to-end deep learning systems. But CALDI is this framework which has been around for many years and is still used a lot in industry. It's a real pain to learn, but once you've learned it, it's incredibly powerful. And so we were able to train our model again, acoustic model, language model. This is how stuff is done in speech. And we developed our own network, so we tried some tweaks on kind of the networks, the state-of-the-art that existed, and found some ways that worked slightly better. Language model, very important. People don't sing the same types of things that they say, right? The shooby-doo-wop doesn't appear in speech. I just said it. I love you, you, you, you, you, you, you, you, you. It doesn't really... We just don't do that in speech, but it comes up all the time. It comes up all the time in singing. And so you need to train a system on... It needs to know that this is a different type of English language or whatever language you're going to train. And so we did some tricks to take kind of the standard pronunciation dictionary that's used for speech to allow things like vowels, which are ten times longer than they are in speech, and we trained the system, accompanied and unaccompanied singing, and put in some tricks for it to recognise the singing... ..well, when there isn't singing, because it's not just silence, because it could be accompaniment. And we found that we could get... These are error rates, word error rates. We found that on the unaccompanied singing, we weren't doing much better than what other people had done, but on singing, on accompanied music, we found the error rate was... We could get much, much lower with this system. So that's gone into a product. ScoreCloud Songwriter, which is... I was very impressed with my student to have something that went into a product while he was still a student. It's already on the market. So that's work on transcription. So I want to briefly talk about... kind of the application of what we do once you've got a transcription. Why is this important for music understanding? And I'll just talk about one project I've worked on. That is a... It was a Digging into Data Challenge project, so that was one that involved US and French and German and UK partners, to look at melodic patterns in jazz. And the basic idea was that we can do a lot now with... with our... powerful computer models, access to lots of data, and so we kind of know, everyone knows that when musicians improvise, they use patterns. That's what they spend all their time doing, practising patterns, right? They get the Bebop Bible or the Charlie Parker Omnibook, they just practise and practise and practise until they can play them all fast, right, in every key, in every position, whatever, and then they go out and perform them. But how do these patterns vary, evolve and spread? Can we analyse that? Can we track them as they spread around the world? Somebody must play a pattern first and then other people copy it, right? And then how do you measure how influential a musician is? Well, by how often their patterns are copied, just like we do with... researchers, right? H-index. Right? You look at how often has this person's work been cited by other people. So... we wanted to kind of look into this question from a very computational perspective. Of course, we were grounded by having some real musicologists in the group, in the team we worked with. And so we did this work on first extraction, extracting the solo line using transcription technology. Similar to the other ones we looked at, it had a convolutional network and a recurrent part for the... kind of the language model part. And we expressed the patterns then as melodic patterns, just basically sequences of pitch. And... um... then... we used the edit distance, or Levenstein distance, to... test how similar two sets of patterns are. And then, you know, we think this is kind of useful, that if you can understand the patterns, then you can recommend music to people, use it in teaching people about music, and also if you want to generate music automatically. So I'd like to show you... an example. Um... Let me just get this right. So this is one where... It's much easier to explain what we did by... by showing you... the output of our work. We have this web base... web site, I should say, connected to our database of patterns that we've extracted from solos. And you can play with this. It's open. It's... there for the whole world. Um... You type in a pattern. And we're expressing these patterns as patterns of intervals. So two means go up two semitones. Minus three means go down three semitones. So you give a list of these numbers. And you search for it. And... you get... a list of hits. And if you're lucky, these will include... the score. But they'll always include the musician, the track it's on, the year, start time in the track, and so on. But... the interesting thing is, of course, that you want to listen to it, because this is music, right? Right? And you know there's a YouTube... video of 91 of these, and so many instances of the lick. Right? So... you can think of any pattern you like. Type it in. Has anyone played this before? Or you... play your own patterns and see... how creative are you, really? Do you play the same things all the time? Right? If you're a jazz player, that's kind of the thing you worry about as you go to sleep. It's like, am I being creative, or am I just playing the same things every night? Well, now we can analyse it, and work it out. So... that's work that we've done, looking into... kind of... patterns. And that's... the point here is that, you know, the music transcription is just a tool to get us to a point where we can extract this information, so that then we can do something interesting musically, to understand, you know, what people... do when they... play music. So we can display this information on a timeline, and you can see all the different people in the database who have played this same pattern, and when they played it. We can do networks of... people who played it, and kind of, you know, look at who's... similar to who. So some of these patterns, this is... yeah, that shows, kind of, you know, a whole lot of different people playing the patterns. So other work that we're doing that uses transcription in my lab is... I have one student, Huan Sun, who's just... been going for about a year, and she's... looking into... expression in classical piano performance. And to do that, she's transcribed tons and tons of piano music, using these models that I described today, and looking to extract, kind of, you know, what are the... the expressive things that musicians... always do, what are the expressive things that only Glenn Gould does, or that only... whoever the... other famous players are. On the other hand, I have a... a student working on jazz, Dave Foster. He's not a sax player, but he's a bass player. But he decided he was interested in looking at expression in... in jazz... saxophone performance. He recorded 24 hours of saxophone music, transcribed it all, using tools, but checked everything by hand. He has 2,000 pages of transcriptions, and he did it in... maybe six months or something, which is... like, I think most of us would take... 10 years... to do that. Amazing guy. Like, really... this is an example. I mean, maybe there's a mistake in it, but... but it's... it's great. So, this is a database, and this has many possibilities of what you can do with it, because once you've got... the audio and the transcriptions and the mapping between them, then you can teach a computer to do this for you next time. And I have another student, Drew, who's just been going for... a year, and he's looking at jazz improvisation solo piano. So, again, using those solo piano models, which work so well. And he's collected a huge database of all the things that he's learned about jazz improvisation and solo piano. So, again, using those solo piano models, which work so well. And he's collected a huge database of recordings to then analyse and to create kind of generative models of jazz performance. They don't sound great yet, but he's still got a few years to go. So, probably data is the thing which limits what we can do. We have incredibly powerful machine learning models that really, you know, they kind of work well for the tasks they're given. So, I think it's really important to think about how we can train the piano for the things they've been shown, lots of data. So, the piano transcription stuff works really well because there's tons of piano data to train it on. Whereas other things, until Dave's data set, we didn't have hours and hours and hours of saxophone music with transcriptions that we could train a transcription model. But now we do. So, we can do it. So, just for fun, I've built a saxophone separator. So, it separates the accompaniment from the from the from the sax playing. And this is what it sounds like. This is just taking a random sax track off YouTube. So, this is with the accompaniment. And I think it's Sonny Stitt. Almost perfect. Let me know and then you'll hear a little bit of the background. But basically... So, data gives us the power to do that. I can't do that for trumpet because I don't have 24 hours of trumpet music separated from the backing. So, once we have the data, we have the ability to model musical phenomena. And I think that's the exciting position we are at right now. And I think that's something I'm going to be doing for the next few years of my research career. And so, thank you very much for your attention. And I should say thank you to all of my collaborators who contributed to this work which I've described. They did the work. I was kind of, you know, supervising it and, you know, nodding at the right times. And, of course, the funding agencies, you know, have been very, very helpful. So, thank you very much.

Summary

Generate a brief summary highlighting the main points of the transcript.

Generate

Title

Generate a concise and relevant title for the transcript based on the main themes and content discussed.

Generate

Keywords

Identify and highlight the key words or phrases most relevant to the content of the transcript.

Generate

Enter your query

Submit

Sentiments

Analyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.

Generate

Quizzes

Create interactive quizzes based on the content of the transcript to test comprehension or engage users.

Generate

Back

Forward

{{ Math.round(speed * 100) / 100 }}x

Select Audio file

Convert Your Audio To Text

Secure and Encryption, NDA

4.9/5 3727 customer reviews

1/732

Verified Order

“I haven't used the customer support yet, but the interface, guides, and easy access to the contact buttons are promising. The output is also really accurate and well-executed:)”

keziah

Aug 15, 2025

“Service is very fast and easy. I noticed a few errors but they were minor. I like your service.”

MICHAEL TRENT

Aug 12, 2025

“Excellent service!”

DanutM

“Excellent service, thank you very much!”

Samantha Cava

Aug 11, 2025

We Trust in Human Precision

Value-Driven Pricing

Trusted by Global Leaders

GoTranscript

24/7 Customer Support