Exploring Music Transcription: Techniques, Challenges, and Applications

Convert Your Audio To Text

4.9/5

3718 customer reviews

Delve into music transcription, from piano roll and MIDI representations to advanced techniques for non-Western music, singing, and drum transcription.

Automatic Music Transcription by Dr. Emmanouil Benetos - Part 4

Added on 09/06/2024

Speakers

Add new speaker

Speaker 1: So, typically what we get as output of these systems is what we call a piano roll representation. So this is a binary representation over pitch and time. Or, we might get as outputs a MIDI-like representation, which is essentially a list of notes, each row is one note detected, start time, end time, pitch. Kind of similar to the way that MIDI files are encoded. And this could be the output of the prescription system, and this could be our ground truth. And we want to compare these two. We can do evaluation on the pitch detection, we can do evaluation on instrument assignment, on identifying pitches and also identifying instruments. We can also estimate the polyphony level, how many concurrent notes did we have at a given time instance. And we can evaluate those systems either at the level of the frame, or each time frame to compare those two representations, or at the note level, so to compare lists of notes like that. And people have proposed accuracy metrics. I won't go into the details, but I'm including equations here in case someone wants to try to work and evaluate such systems. Maybe more interesting is the note-based evaluation. So, how do people evaluate these systems at output lists of notes? Say that a note is considered to be correct if its pitch is within a quarter tone of the ground truth pitch, so half a semitone plus or minus the ground truth pitch. And the onset is within 50 milliseconds before or after the ground truth onset, with a bit more flexibility on the offset because durations are a bit ill-defined in general, and it's hard for humans as well to transcribe and annotate durations properly. These specific values, quarter tone, could be deemed maybe by musicologists as too relaxed. Maybe, I think studies have shown that we humans are able to discern different pitches by much finer resolution. And also 50 milliseconds might be considered to be completely unacceptable for some applications, and for some cases, because I think we are able to discern notes that are even less than 10 milliseconds apart. I think that's perceptual studies have shown that. And on instrument assignments, we are not only trying to evaluate pitches, but we're also trying to assign each pitch to a specific instrument, so this further complicates the problem. The overall issue, however, we have with all these evaluation methods is that they're actually quite objective, in the sense that we're just comparing system outputs. We're not really going to the detail of what kinds of errors are acceptable or not acceptable in a transcription system. A lot of these systems, they can make octave errors. We might have a C4. They might detect both a C4 and a C5. This might be more acceptable compared to a system which tries to identify a C4 and detects a C-sharp 4, for instance, so a semitone error, because this sounds much worse to us. Because it might not follow music theory rules. It might not follow our own expectations of the music. So there's a lot of work to be done on how to design metrics which are musicologically relevant or perceptually relevant. Any questions on metrics? So another small thing I wanted to mention about evaluation, especially to students who might want to do their project in transcription sometime, is that there is actually public evaluation happening every year at the so-called MIREX competition. So MIREX stands for Music Information Retrieval Evaluation Exchange, and it is an annual public evaluation campaign or competition, or a challenge as you might call it, taking place every year as part of the ISMIR conference. And every year there are tasks related to transcription. For the frame-based task I mentioned, the note-based task, either for many instruments or just for piano, or for the string segregation task, which is a more difficult one. Every year the competition has a deadline around August, with results being published around October. And should you be interested, I would encourage you to participate. I think it's a really nice task. I have participated myself over some years. Just to give you an overview of numbers, if they mean anything, we still have some time until we reach a performance that might be considered to be good enough. We're still in the 70s, I would say, so there's still room for improvement here. This is the frame-based results. Note-based results. There are some people who have made some focus on this. However, both these implementations were made by companies, and therefore we don't really know what happened inside there. So we'd like to see what we can do a bit more in terms of extracting knowledge for improving transcription. Question on public evaluation. Now a brief mention of how transcription relates to other problems, both in MIR, music information research, and also musicology. Music source separation. Music source separation is when we have one waveform, one input audio recording. We want to create multiple waveforms with separate sources. Maybe we have one mixture of many instruments. We want to create a separate recording for each instrument. This could be for the purposes of remastering. This could be for the purpose of reusability of the content. And the problem of multi-pitch detection, of detecting pitches, is really interrelated with source separation. Assuming that we have a recording with piano and violin playing together, mixed together, it would have been much more easy if we had access to the individual stems of the piano and the violin. Vice versa, it would have been much more easy to separate the piano from the violin if we knew the score, if we had the transcription. So these problems are really closely connected. And over the years there has been a lot of research. However, it still remains to be seen whether we can really join the systems together in such a way that they would actually work. This relates to some of you who might be more proficient with machine learning. This relates to the ongoing trend in machine learning nowadays, which is on multi-task learning, on trying to create systems that can address multiple tasks. Score following. This is the process where you have access to an audio recording and a score, a reference score. And you want to align the performance, you want to follow the score as the performance goes along. Again, representations derived from automatic transcription are actively being used in commercial tools for score following, like the Antesco4 system developed by Aircam in France, for instance. Score involving transcription. What is that? Let's say we want to create systems for automatic instrument tutoring. Let's say we have a student who performs a music piece. They might make some mistakes. Can we use transcription in order to identify those mistakes made by the student? And I don't mean necessarily performance mistakes, which would be probably an impossible task at this point. I mean, maybe local mistakes, maybe simply mistakes in terms of pitch spelling. And there has been quite a lot of research, including NUS here. Yi Wang has been doing a lot of research with a multi-model via transcription tutoring, using both video information from the fingering and the bowing, as well as audio information, and also having access to a score. A general approach is that if you want to create a tutoring system, you have access to your recording made by the student, you have access to a score, a reference score. You try to align the score with the student's performance, and then you are trying to identify mismatch in areas where the student might not have performed as expected by the score. That is the general idea, and there is a lot of things to be done in the area of using transcription for music education purposes. Content-based music distribution. Transcription features and transcription outputs are nowadays used in most commercially available systems for music similarity, music recommendation. I'm talking about everyday uses from services such as Spotify, where they use pitch features as a way to perform more robust similarity measures compared to just using the raw audio. On musicology, there has been some use on transcription, for example, on detecting repeated patterns in polyphonic music from audio, instead of doing that from a score level. Along with a colleague of mine, Dan Tidhar, who is a harpsichordist, we have been doing a lot of work in estimating harpsichord inharmonicity and temperament. The idea being that back in the good old days, 17th century, people didn't tune their instruments using what we call nowadays belt or nickel temperaments, which is the standard tuning system used nowadays pretty much everywhere. They were tuning their instruments using different tuning configurations or temperaments, according to perhaps the directions of a specific composition and different tuning configurations, different temperaments, they correspond to different possibly moods or emotions that the composer would like to convey. This requires very high frequency precision in order to determine those tunings, and we can do so using transcription. And we can study the evolution of tunings over the years for different recordings. More into physics and musical acoustics, we can also use transcription as a way to estimate parameters of instruments, to perform instrument modeling. Francois Rigaud was able to use transcription as a way to model the inharmonicity of a piano. The piano, as mentioned before, is inharmonic, and each pitch is differently inharmonic compared to the other pitches. The higher pitches are generally more inharmonic compared to the lower pitches. Each string in the piano is separate, so it abides to its own inharmonic rules. And we can use transcription in order to do that. Another way of using transcription, maybe we can use the raw output of transcription, so this is essentially kind of like a non-binary piano roll, for visualizing a music performance. And we can use that also to analyze music performance. A lot of contemporary music does not really rely on Western staff notation or the existence of a rigid score. It relies more on non-binary, continuous aspects. It relies on new ways of representing timbre and pitch, and we can use transcription in order to do so. That was it about related problems. Any questions so far? Another brief section about software. Again, because of this tutorial, I wanted to share with you some resources so that you can easily implement something. So this list is not an exhaustive one, but I think it's fairly representative of the availability of systems which are publicly available right now across different languages. And this is the Onsens Frames one, which is, I think it's partially available in the sense that the training data is not fully available. But the pre-trained model is there, and the basic neural network architecture is also there. And there's also quite a lot of commercial software just for transcription. This is also a partial list of some of the software I'm familiar with. Unfortunately, the demo is not going to be happening because of the AV issues that we have. However, you're very welcome to try out Sonic Visualizer, which is the software which is released by Queen Mary University of London. Sonic Visualizer is software we are cross-platform that we developed at Queen Mary in order to visualize and extract information from music and sound. It's not only for transcription. People over the years, PhD students, post-docs, have been developing different plug-ins for different functionalities. Myself, I have been developing a plug-in called SylVets, which is for transcription. There are also plug-ins for source separation, alignment, quarter cognition, key estimation, bit detection, and so on. And that brings me to the final section of the tutorial, which is on a few more advanced topics for transcription. I want to go through some more specialized cases for transcription, also reflecting some of the research done here at NUS. On transcribing percussive instruments, singing voice, non-Western music, and also on using a few more advanced techniques, such as language models, for improving transcription. On creating systems that come up with what we call a complete transcription into staff notation, and also further research for evaluation. So for drums, or percussive instruments, something like this maybe. So we usually have kick drums, snare drums, hi-hats, cymbals, toms. And this has developed a bit separately from what I've been talking about so far into the area of automatic drum transcriptions. The task here is mostly referring to transcribing solo drums, or reducing percussive sounds in order to transcribe pitch sounds. And less and less maybe on transcribing both drums and pitch sounds together, which is relatively unexplored as an area. Drums operate a bit differently compared to the kinds of sounds I've been showing before. Percussive sounds, in general, do not exhibit clear harmonics. They might not even have any harmonics at all. Usually you see activity almost across the entire frequency spectrum, you see activation in all frequencies. Depending on the drum type, you might observe some general frequencies happening, maybe in a broad range. Some drums, some percussion can be tuned, some less so. In percussion, we're not generally interested in the offset. We're interested in the onset, in the start time. Often in drum transcription systems, what you see are these vertical lines, which denote the start of a specific hit, and not the end time of that. And over the years, people have been working with similar methods than the ones I've presented before. Using NMF as an example, and also using NMF in a real-time manner as well. One work I was doing a few years back with Sebastian Eber, was we were trying to come up with a system that was able to transcribe both drums and pitch sounds at the time the application was for jazz music. A bit more recent, of course, neural networks came into the forefront, and people started using recurrent neural networks for the task. The reason being that recurrent networks are relatively successful for modeling time series data. And a more recent comparison that was made between NMF and neural networks for drum transcription is that, yes, neural networks do outperform NMF, but maybe less so in the case of more complex data sets. Maybe in the case of unseen data, where you don't know the specifics of the drum kit. Maybe when you don't have access to the specific recording conditions or the acoustic conditions happening, then you see less of a difference between the two methods. Overall, performance for drum transcription is good. That problem, however, is a bit more limited in the sense that you only need to identify the drum kit type. Kick drum, snare drum, hi-hat, stomp, cymbals, stuff like that. You don't need to care about pitch. However, when talking about drum applications, often a greater deal of temporal accuracy is needed compared to pitch sounds. And very often there's also a need for real-time processing as well, depending on the application. And current methods are not that good with temporal precision, as of yet, anyway. And also another open problem is still coming up with good ways to transcribe both drums and pitch sounds at the same time, which is quite commonly found in contemporary music, pop-rock, jazz, etc. Singing. That's the second topic. So far I've been talking about only musical instruments. However, all popular music cultures around the world use singing. And singing voice is immensely expressive, much more so compared to most musical instruments. And common representations that we use, such as a MIDI format, are totally inadequate to represent singing. There's all sorts of challenges on different phonation modes. Voiceless, breathy, present. Different singing styles. We have choral singing style, pop, theatrical, overtone singing, and so on. And we have other challenges. We have intonation. What is intonation? It's the ability or the inability of people to sing in pitch. Drifts. When we start singing at some point, we start drifting from our reference pitch towards some other pitch. And of course, poor singing. Which kind of happens. You can watch YouTube for that. So, what do we do in this case? We usually have a very fuzzy output in terms of pitch when we're talking about singing voice. Which would need to be quantized into maybe a pitch in MIDI scale, a semitone scale, something like that. And temporarily segmented. However, by doing so, we lose all that interesting information about the vocal performance. Singing voice, we mostly nowadays, in the context of our research in music information retrieval, is addressed mostly throughout the problem of monophonic pitch detection. Which could be considered to be solved. State-of-the-art methods include Melodia, the PIGIN algorithm, and more recently the CREP algorithm. The latest one is one of these end-to-end convolutional neural networks. And here is a link for Tony. This is one other software that has been developed at Queen Mary, which is meant to transcribe and analyze singing voice. Challenges on that is that we have not yet been fully successful for tracking pitch over time for singing voice. We have not really addressed a lot of different vocal techniques. We have not really addressed the problem of pitch shift. Current methods are simply not successful for transcribing both singing and instrumental accompaniment. Recently, with my colleague Rodrigo, we are trying to see what would happen for transcribing music with multiple singers. Such as this one.

Speaker 2: We are trying to do some experiments with chorals and barbershop music.

Speaker 1: We are trying to identify the pitch as well as try to assign the pitch to a specific singer. Soprano, also tenor, etc. But still there is quite a lot of research remaining to be done in the area of trying to transcribe singing, which is in the non-monophonic case where you have either accompaniment or you have multiple people singing at the same time. And the other aspect of singing is that, of course, it also contains spoken words. It contains lyrics. And this is a slightly separate problem which we call lyric subscription. In this case, you have as input an audio recording and you want to recognize the lyrics automatically. In order to do so, a good approach might be first to detect the phonemes. In the same way that we do for speech recognition. And most research for lyrics transcription is in fact based on the way that we do automatic speech recognition. Very much so also because there is a lack of data, of annotated data, for lyrics transcription. So we have to rely on maybe pre-trained models for speech recognition, maybe slightly adapted for singing. Another reason for maybe transcribing lyrics was because vowel types can be important when distinguishing the colors of singing voices as well. And usually the systems for singing transcription come up something like this. Maybe the non-neural network ones. We have audio recording, we perform, we extract a spectrogram, we extract some features, we have an acoustic model. And then we have a language model. The language model is trying to predict a natural sequence of phonemes and words. So that it makes sense maybe in a language, in the sense of grammar and syntax for instance. And usually we have a spectrogram, we detect the phonemes, we detect based on the phonemes, we assign words. And then the complete sentences. However, as mentioned before, there is a lack of annotated data. There is the aspect of the challenge of trying to transcribe singing voice in the presence of accompaniment. Instrumental accompaniment. And as Wang Li has been doing here, for example, is on mispronunciation as well. You cannot assume when transcribing the lyrics that the pronunciation is going to be Oxford style pronunciation or something like that. Things get different in real life.

Speaker 3: I think language models would be pretty hard for singing as well because lyrics tend to be more like poetry rather than...

Speaker 1: This is a classical, yeah, yeah, that's totally true. Because I remember last year was it that someone tried to create a system for lyrics transcription. They trained their language model based on the Reuters news corpus.

Speaker 4: But to address this question, we could probably use the lyrics corpus alone to do the training.

Speaker 3: What do you think? You could. That would certainly help. A lot of lyrics are very, very strange. That's what makes some good lyrics. They're unusual combinations of words.

Speaker 1: I'm going to go into a bit more detail on language models later on. But I think it's maybe my personal view is that in music, each composition might essentially have its own language in a way. Both in terms of lyrics and the pitches present. There are commonalities and music theory rules and all that. But relying more on the composition itself, the piece itself might be a more useful cue. And exploiting maybe repetitions within the cue rather than relying on a language model that might not reflect these unpredictable lyrics, for instance. Personal view, again. So, non-Western music. The third subtopic here for advanced transcription. So, the vast majority of research, both in music information research and specifically for music transcription, is maybe unfortunately focused a lot on Western music. However, this is called, some people might call it, Eurogenetic music. It generally assumes 12-tone equal temperament. We generally assume that, let's say, we have the frequency space of one octave and we split that into 12 different semitones. In the case of 12-tone equal temperament, we equally divide the octave into 12 semitones. However, this doesn't really happen in music cultures around the world, which might have different ways to configure pitch. For example, cultures in the Middle East, using the Macamo, they are generally dividing an octave into 53 microtones. And this poses also quite a lot of problems on how do we develop systems that are able to detect pitch in such fine scale. Another assumption is that of monophony or polyphony. Again, monophony I refer to when we have one note at a time. Polyphony, when we have multiple notes at a time. However, in several music cultures, this could be Indian music, this could be music from China, this could be music from the Middle East, we have heterophony. We have one main melody which can be interpreted by multiple instruments in different ways. Possibly, but not necessarily in different octaves. Another assumption is the fact that we are usually relying on major or minor modes. Whereas in fact, in Indian music we have tens of ragas, in Turkish and Middle Eastern music we have tens of makams. So different modal styles which are not really reflected by the kind of research we are doing. And sporadically, over the past few years, there has been some research on bringing more world folk or traditional music into the forefront of music information research. On flamenco, on Turkish makams, on Chinese opera, on Irish fiddle. Some examples, limited though. Increasingly so, the computational ethnomusicology community is picking up a bit on using some of these methods for world music analysis. One example of work I have been doing with Ander Holzhapov, who is an ethnomusicologist. We were trying to create a system for transcribing Turkish microtonal music that would operate in a finer resolution compared to the 12 tone one. In this case, the system was able to transcribe pitch on a 20 cent resolution. What is a cent? A cent is a subdivision of a semitone and one semitone consists of 100 cents. So 20 cents is essentially one fifth of a semitone in terms of resolution. So this is a kind of a recording that we are trying to transcribe. The tuning configuration is based on the Turkish music configuration which assumes 53 different microtones. And each mode uses only 12 of those microtones in order to come up with a composition. The output here is not a piano roll, it's a much finer representation in 20 cent scale. So we had to come up with approaches in order to address the problem of high resolution pitch detection. More broadly on world and traditional music, at Queen Mary and UCL and City University, we have been collaborating together on trying to transcribe recordings from the British Library's World and Traditional corpus and providing those transcriptions to ethnomusicologists and making those transcriptions available. Still though, the British Library has only access to recordings from a few countries. I don't know if you can see that clearly, so the grey countries are the ones we don't have access to. We only have access to mostly UK, Ireland and former colonies or the Commonwealth countries, as it might be called. And there's a lot to be done in the area of transcribing non-Western music with respect to getting access to data, which is really tough, and specifically annotations, to decide on whether we need to focus more on what musicologists would call music universals, so universal representations of music, versus culture-specific transcriptions. Different cultures have their own music notation systems, whether we need to come up with systems that transcribe using a specific notation system or we try to come up with a general purpose solution to suit different cultures, maybe a more abstract representation of pitch and time. There is also the problem of proscriptive versus descriptive notation, in the sense that usually in world folk traditional music there is no score, and we are able to produce maybe a descriptive notation which describes the performance, but we are not yet able to properly come up with a descriptive notation, a recipe for reproducing that performance, let's say. And there's ongoing engagement with the world music community. The Folk Music Analysis Conference, for example, is aimed at bridging together computer science and ethnomusicologists towards this area. Another topic I wanted to briefly touch upon was the use of language models. Most approaches I've shown you so far were purely audio-driven. We just have the audio, we have the model, we hope it's going to work. The errors that these systems make are not really meaningful, and they could be avoided if we integrate some musical knowledge, maybe from harmony, from key, from music structure, maybe from composition roles. And this is really the way that speech recognition works. Systems for speech recognition, they do rely on the audio signal, but they also have a language model in the back, which would help with word disambiguations, which would help predict what is the next word to come. The problem, of course, is that music is a bit more complex compared to speech, at least in the sense of having many concurrent events happening. And we also have quite long temporal dependencies. In rhythm, we have quite a lot of music theory that's difficult to be, maybe, implemented in a computer system in order to form an analysis. Some approaches have been made, maybe using probability theory. We have some observations, which is the audio, and we link them to notes, and we can link them to chords. Notes are linked together over time. Chords are linked together over time. But the improvements shown were not that great, because these kind of models rely not on specific context in terms of music style. Music composition might depend a bit on the composer as well. More recently, with neural networks, there was a big boom in using language models, because before, with more traditional ways, we were not really able to easily model multiple concurrent notes over time. This has been, I could say, overcome by neural network approaches. This is an example of a language model, the so-called RNN-RBM, Recurrent Neural Network Restricted Boltzmann Machine, which has been successfully used as a language model for polyphonic music prediction. More recently, at Queen Mary, we also were able to combine a language model with an acoustic model to improve piano transcription performance. We had the acoustic model, which was a neural network model. We had another neural network model, which was the language model. We were able to integrate them in a probabilistic framework in order to produce a more musically meaningful output and to avoid some common errors made from transcription systems. And next to last thing, so far I've talked about methods that produced some pitch estimates in terms of frames or notes with a start and an end time, but I haven't talked about coming up with something like this, a score and staff notation, what I could call a complete transcription. Up to a point, current systems can detect multiple pitches. They can detect their timings. They might detect instruments. They might assign notes to a specific instrument. Some other systems might be able to infer rhythmic information, detect the tuning, extract some dynamics, fingering, but still work needs to be done in order to combine everything in order to produce a final score in staff notation. There are quite a few publicly available tools for... you give them as input a performance MIDI file and they can output a score in staff notation. Sibelius, NuScore, Uniprompt, ScoreCloud, and so on. However, there's little evidence on whether the... what if we put as input the output, the noisy output of a transcription system? Is the resulting score going to be good enough or not? Because no research on that. Oh, what was that?

Speaker 3: Okay.

Speaker 5: Again.

Speaker 4: Try to click it, it's not recording. That's it.

Speaker 2: Okay. Okay.

Speaker 1: That works. Okay, almost there. So more recently, with Kyoto University, we're trying to come up with a system that was able to produce a transcription in staff notation. So maybe using an input such as this. We're able to produce a temporary quantized version that could be typeset into a score. Typeset. So, still some mistakes, but at least the timings are appropriate. And even more recently, there has been some work in what we would call end-to-end complete transcription. So, coming up with one single system that goes directly from an input waveform to a final score in staff notation using the Lilypore notation, which is one common machine-readable notation system. So far, it is good enough for monophonic transcription, but not quite there yet for the polyphonic transcription case. But there is progress done in this field. Final thing, evaluation. As mentioned before, some errors are more important than others. Maybe some notes in this performance could be deemed to be more musically important compared to others. And their correct or incorrect detection could be more important compared to the incorrect detection of some other notes in the accompaniment here. So, some errors could be more musically annoying. Usually, harmonically related errors, octave errors, are less annoying compared to, let's say, a misidentified semitone error. When we have wrong notes outside the scale, it's usually more annoying compared to having mistakes within the scale. So, the annoyingness, however, depends on the actual application. And when we play back a transcription, if we miss some notes, that's less annoying compared to having extra notes in the transcription. And currently, I'm trying to run a study with musicologists on trying to evaluate the quality of the errors that these transcription systems make. And I have some ideas about how to design musically meaningful measures which still need to be implemented. We need to analyze how music teaches great music dictation in order to get some inspiration, even though these errors might not be represented from the errors made by computers. We can also have subjective listening tests with different types of errors generated by systems and ask people to evaluate them. But it's quite difficult to find qualified people to do so. The study I'm running right now, it's been running for one year, we've found, I think, eight qualified subjects who are proficient in music transcription and are willing to give their time to do so. And that was it. I hope I was able to convince you on the merits of music transcription. It does still continue to attract entrants in the music information retrieval community, but maybe, interestingly so, also, if you go to these big machine learning conferences, you might also encounter papers that are devoted to this. The performance of such systems has increased over the last decade, but still quite far apart from what might be considered satisfactory. However, if you constrain your application to a specific model or a specific music style, then you can get a decent performance. And in any case, even noisy features, noisy outputs from transcription could still be useful for more high-level applications, such as for code recognition, for key recognition, for music similarity recommendation, and so on. As the scope of this research increases, the number of applications and programs increases as well. Throughout this tutorial, with my colleagues, we were able to identify about 300 papers from the last 10 years doing transcription, so it's really a big field. We have generally agreed that a successful system cannot only rely on the audio signal, and we need to take cues from other disciplines, from acoustics, from music theory, and music perception and cognition. As of yet, there is no unified, or there is no consensus of what is supposed to be a good and appropriate methodology for the task, and this yet remains to be seen. Thank you very much for listening. Do you have any questions? Yeah, questions.

Speaker 6: Yeah, questions. So many of the researchers we've seen today are based in classical contexts, we've seen all of them talking about traditional instruments, but we all know that for the more recent music, they have very heavy use of synthesizers, and all this signal processing, so I'm sure that this will have a very acute impact on the frequencies and all of those features, so is there any research we're trying to adapt to the more recent music?

Speaker 1: Yeah, I mean, the problem with contemporary music is that you have a lot of distortions, and you have a lot of effects, which makes things difficult. At the same time, you might not want a proper transcription of those pieces, and for pop-prop, usually what you get is a leachy transcription, for example, Sony has been working a lot on that, coming up with transcription of the melody and the bassline, for instance, which could be enough, so you don't need to look at everything.

Speaker 3: Sam? Yeah, just for a comment, I was going to say that you're saying that if the transcription system is going to make mistakes, it's nicer if the notes make musical sense. In some ways, if you're going to have transcriptions that need human cleaning up anyway, it's probably better to have errors that stand out more. Makes sense. Yeah, I agree with that. Aim for things that are easy to correct.

Speaker 1: Yeah, if you want to correct the errors, then yes, it might make sense to have the out-of-key semitone errors compared to the harmonic ones, but if you just want to get the transcription out without any correction, then maybe a language model which would clean up the messy bits.

Speaker 3: It's hard to see what the purpose of just getting a transcription and playing it back would be anyway.

Speaker 1: There are actually applications in this. So IRCAM, over the past few years, they have been successfully using transcription systems for their performances for automatic accompaniment purposes, for instance. Another thing is that when using transcription as a downstream task, let's say for chord estimation, then fixing up some of these semitone errors would then create a more reliable, high-level estimation.

Speaker 5: Any questions? Any other questions?

Speaker 1: It's really a part, at the moment, the problem of lyrics transcription and pitch estimation is quite distinct so far, and there's only been a bit of progress, I would say, in the past year on trying to create an integrated system that addresses both. Yeah?

Speaker 5: So is it really helpful to incorporate the pitch information towards the focus modeling for semitones?

Speaker 1: Yeah, kind of different, but I mean, it's really tough to transcribe singing voice if you don't know the pitch, the exact pitch, because the fluctuates quite a lot. I think it's much easier if you have information on the pitch, it would be much easier to identify the lyrics afterwards. I think that's my proposal.

Speaker 4: Okay? Any more questions? If not, let's actually put our hands together and thank Emmanuel for an excellent discussion.

Summary

Generate a brief summary highlighting the main points of the transcript.

Generate

Title

Generate a concise and relevant title for the transcript based on the main themes and content discussed.

Generate

Keywords

Identify and highlight the key words or phrases most relevant to the content of the transcript.

Generate

Enter your query

Submit

Sentiments

Analyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.

Generate

Quizzes

Create interactive quizzes based on the content of the transcript to test comprehension or engage users.

Generate

Back

Forward

{{ Math.round(speed * 100) / 100 }}x

Select Audio file

Convert Your Audio To Text

Secure and Encryption, NDA

4.9/5 3718 customer reviews

1/730

Verified Order

“I needed an interview transcribed accurately and I was happy with the quick turnaround. ”

Jen

Jul 20, 2025

“Very accurate transcription, fast service, easy to use and order, thank you!”

Gabby

Jul 15, 2025

“I am beyond happy with this service, which I am using it produce interview transcripts for my dissertation research. The interface is easy, the customer service was prompt and informative, the transcript is accurate, and the pricing is wonderful. I will recommend GoTranscript to anyone who is in need of affordable human-powered transcription services.”

Justin McDonald

Jun 29, 2025

“great work. quick and professional”

christian oradesky

Jun 28, 2025

We Trust in Human Precision

Value-Driven Pricing

Trusted by Global Leaders

GoTranscript

24/7 Customer Support