Exploring Digital Technologies for Music Analysis at Queen Mary University

Convert Your Audio To Text

4.9/5

3721 customer reviews

Discover how Queen Mary University leverages digital technologies for music transcription, focusing on diverse music cultures and addressing Western bias in computational models.

ISCEA2017 E. Benetos - Automatic transcription of world music collections

Added on 09/06/2024

Speakers

Add new speaker

Speaker 1: So, I'm from Queen Mary University of London, and the main aim of this group is to apply and develop digital technologies for analyzing both music and audio. It's a fairly large group of more than 100 people, including staff and research students, and probably the most famous output of this group is Sonic Visualizer, this tool for visualizing and extracting information from audio, which has been used also quite a lot in computational musicology applications. And it is broadly interdisciplinary in the sense that we are based in a computer science and an electronic engineering school, however, the group attracts people from also music performance backgrounds, musicology backgrounds, mathematicians, physicists, engineers, and so on. Myself, I'm based more in the music informatics and machine listening field, along with Bob, who's also present today. I'm going to briefly define what I call automatic music transcription, which you might define yourselves as audio transcription or digital transcription, as I've heard yesterday. And then I will mention some specific studies about using these technologies for analyzing Turkish macabre music, cretin dance tunes, and a larger project for using the British Library Sound Archive. So I would define automatic music transcription as the process of converting an acoustic music signal into some form of musical notation, which can be either machine readable or human readable, depending on your context and application. It is a fairly difficult problem, especially in the case of what we computer scientists would call polyphonic music. Basically, I mean many notes concurrently, I'm not referring to polyphony per se. And especially in the cases where you have many instruments at the same time. And there are quite a lot of applications in this problem. Firstly, in music informatics per se, for organizing and navigating through music collections, but also in terms of music production, interactive music systems, including automatic accompaniment, as Chrysoula mentioned in the previous talk. And of course, computational musicology and ethnomusicology. And it can be divided into several small problems, including the pitch detection or multi-pitch detection, instrument identification, onset detection, derivation of rhythmic metrical information, and then engraving and typesetting, putting everything together into an actual maybe human-readable score. So that's that on the problem, but what about using these methods for world music collections? So the problem that we are facing, and that many methods in MIR have been facing, is the fact that there's this so-called Western bias. So many of these methods are developed using maybe data that are predominantly Western in some way. Maybe we also make some assumptions when developing our computational or mathematical models about maybe 12-tone equal temperaments, maybe the presence of specific timbres, for example, maybe orchestral Western instruments, and so on. I like the term Eurogenetic. This is used by some of my colleagues that were based in Istanbul, in Bogazici University, as opposed to what we call Western music. And also in terms of how we evaluate these methods. So often in the MIR field, we usually evaluate this and maybe with respect to a semi-tone scale or some small deviation from ideal so-called tuning, which would not apply for world music. And there are also some specific challenges that when analyzing world music collections that you don't really find in so-called Eurogenetic music, for example, heterophony. So when you have many instruments interpreting the same melody, maybe in different ways, maybe in different octaves. So we need to address this issue. Microtonality is another thing. You've heard before the performance of the quarter-tone piano. This is fairly common in world music. And also, of course, the presence of different timbres, different instruments, and the lack of data that we have, of annotated data, to train our methods. So that's a big problem. Usually when you are talking about automatic music transcription in the MIR sort of community, what people would expect would be a figure something like this, a piano notation of pitch over time, and pitches usually in semi-tone scale. That's not good enough when talking about world music. So a few case studies of applying these and developing new transcription technologies for specific music cultures. The first one was with a colleague of mine, Andre Holzapfel, who's now at KTH in Sweden. And Andre, at the time, he was based in Istanbul, and we were investigating the opportunity to apply these automatic transcription methods to Turkish makam music. And the main motivation was that there is a huge population, maybe approximately the size of the EU, that actually listens to musics that are based on similar model concepts to Turkish makam music. That includes, obviously, music from Iran, for example. And there are quite a few interesting case studies that you can make afterwards about how to use automatic transcription of Turkish makam music to study improvisations and differences between performers. And by also having an automatic transcription as an intermediate step, you can also explore some interesting tools that are available in the realm of symbolic music analysis or symbolic MIR. Interesting to note is that in the context of Turkish music, there is what we call an enhanced Western staff notation, with a few more accidentals to indicate the presence of microtonalities. And so the challenge we were facing, basically, can we extend existing Western automatic transcription systems to transcribe Turkish makam music? So in order to do that, we had to spend quite a lot of time annotating some data using resources from the CompMusic project, which was a fairly large ERC project led by Universitat Pompeu Fabra in Barcelona, focusing on MIR in world music. We had access to a series of recordings of Turkish makam music, along with reference scores that were not aligned to the performance. And we spent a lot of time using a semi-supervised approach to align the notation that was in a specific format called SimTR, which is essentially a MIDI-like representation enhanced for Turkish makam music, and align that notation to the audio. And also then align it to the tonic, which is a key concept in makam music theory. So we had this nice collection, and using that collection, we used that to evaluate and create a system that was able to transcribe these recordings into this SimTR symbolic notation. The idea is that you have a system that takes as input an audio recording, and we developed a model that is based on a series of a dictionary of pitch of note templates. And these note templates were extracted quite laboriously from several recordings of Turkish makam music. We had from different instruments. We had ney, we had tambour ney is a reed instrument, tambour is a plucked string instrument. We had percussion, and we isolated note samples, and we annotated them. We put them all together in a dictionary, and we created the system that is so-called makam-informed. So the user supplies an audio recording, the user also supplies the makam, so the main mode of the piece, and the output is the series of notes identified by a start point, an end point, the pitch centered by the tonic in 20-cent resolution, note that. And we also had another process, because many of the pieces we were transcribing were heterophonic, so we had to come up with a way to suppress the output, so to end up with one single melodic line that might be coming out from several instruments that might be interpreting this melody in different octaves. And in the end, we ended up with a system that's fairly accurate for transcribing Turkish makam music, and can be used as a basis for making at least some manual corrections in order to come up with proper and usable scores. And we also were able to study a bit how the actual music performance practice can deviate from the theoretical sort of tunings implied by the notation. And we did face a lot of challenges, including, of course, heterophony, but also the presence of percussion, and some of these percussive instruments are also pitched, as in some talks we listened to yesterday, and that's also a problem that adds up. This figure here shows the output in 20-cent resolution for a specific recording. I'm going to try to play it now, hopefully it works. No. Okay. Let's try again. Let's see if it works now. So I'm going to play now the automatic transcription of this recording, which is essentially a MIDI file. So at least for this particular case, the system was able to detect most of the notes correctly, but this was one fairly simple piece, and even though it was a bit noisy, but there are more challenges when we're talking about ensemble pieces with many instruments and percussion. Moving on, so this was a first attempt to create a description system for a specific music culture that was a non-Western music culture. And then, also tying in very nicely with Chrysoula's talk, we are both from Crete, by the way, and we decided to also do another study on using these methods to transcribe music from the island of Crete. So we had access to another set of recordings from a project that was originated from Greece, that was about creating a corpus from Cretan dance tunes. And the corpus had audio recordings, but also some reference scores transcribed by ethnomusicologists. And so we wanted to take the system to a next step, not only to output some MIDI-like representation with just notes and start points, but also to try to create a proper staff notation in the end. So we also needed to use automatic methods for extracting rhythmic metrical information as well. And we ended up also creating a nice, to expand this corpus to also include, again, to align the audio recordings we had with the reference scores to create sort of enhanced scores that also contain metrical information, but also the note timings. And we created a system that is essentially doing beat-informed multi-pitch detection. So it both detects the beat, but also automatically detects multiple pitches in order to output a proper human-readable score. So we selected a specific set of dance tunes called the Susta, which is a very common dance in Crete. And we created a small but very rich corpus of audio recordings, music XML staff notation, and MIDI files that contain both temporal and metrical information to achieve our purpose, and also metrical annotations at the same time. So the scores provided by the ethnomusicologists looked something like this. And again, we created a system that uses as input an audio recording, performs both multi-pitch detection, but now also beat tracking in order to output a proper staff notation. And also we incorporated some tuning estimation that was crucial in order to come up with an accurate result. So the result looks something like this. So the top two staffs are the automatic transcription of the recordings, and the top-bottom staffs the manual transcription of the same recording from an ethnomusicologist. By simply a first glance, there are not a lot of commonalities. If you look very closely, though, you will start recognizing some commonalities. Also the different sort of assumptions made by the machine listening system versus the assumptions made by the human as well. Again, I will try to play some of these recordings, hope that it works again. No, again I have to unplug. So I'm going to play now the recording, the original one. Now the synthesized transcription. And two together. So again, there are mistakes. The method is not perfect by itself, but we are getting there, and that's a good thing. That's a promising thing. And future, in the future, we also want to investigate not only objectively the difference between the automatic and the manual transcription, but also perceptually the difference between the two, which is something that is yet missing from the MIR field. And the final thing I wanted to mention in this talk was then an application of these methods to a larger corpus that was done through this digital music lab project that was a fairly large project involving many institutions from the United Kingdom, and also involving the British Library and its sound archive. The main focus of this project was to try to enable digital musicology research in so-called big data, or large music collections, to connect different corpuses, different data sets. So we tried to connect both the sound archive from the British Library, but also some other collections like the CHARM database, which is from the HRC, Center for Historical Music Performances, and also to try to make some of these MIR tools more accessible to musicologists without having to code anything. And also, finally, to try to, let's say, not avoid exactly, but maybe circumvent some copyright issues by making the computations that we needed to do on-site at the host of the archive, but to share with interested researchers and musicologists the derived features and visualizations of those features. And specifically, this project also involved so-called classical music, but also world and traditional music. And for the latter one, we had access to over 29,000 recordings from the British Library world and traditional music collections. The recording dates span a wide range, and from the oldest ones, starting from the 19th century, are essentially box-cylinder recordings that are quite a big topic in this symposium, mostly from folk songs from English, Welsh, and Scottish Gaelic languages. Also, some historical recordings made by the former empire at the time. And another interesting thing is that we had access to also very rich but really noisy metadata. I think yesterday there was this example about having Yugoslavia as a country which is no longer existing, and this is a very relevant problem we had to face. When we're talking about geolocations, it's not simply a spatial issue, but it's more of a spatial-temporal problem. And there was a PhD student at Queen Mary that spent a lot of time annotating and curating these metadata. We don't have that much expertise at Queen Mary on metadata. This is something that hopefully some of you could help us with. And these collections are predominantly from either Britain and the British Isles, but also from the so-called former empire or Commonwealth. So, Uganda, India, for example, in addition to Britain and Ireland. In terms of the MIR methods used, we were focusing on various levels of what we would call descriptors or features. So, we have low-level so-called audio descriptors such as audio or spectrograms or onsets, leading to what we might define as mid-level descriptors of notes being present or chords or beats, and finally to higher-level concepts that might include temperament, instrumentation or chord patterns. Specifically for automatic transcription, we had two settings. One was a semitone resolution transcription, but we also had 20-cent resolution transcription that was especially useful for the world music collections we had. And these transcriptions were also used as a basis to compute other more maybe high-level descriptors such as pitch histograms or tuning and similarity. And in order to make all these outputs accessible to the wider community, there is a website from the Digital Music Lab where people can browse through these collections that have been used and analyzed. And the aim here is not so much to analyze individual recordings, such as from the CREM presentation this morning, but to analyze and compare groups of recordings, larger collections. So, maybe sort of a larger-scale comparative musicology sort of applications, let's say. And we also used quite a lot of support. We had also quite a lot of support from the Telemata project as well in order to set everything up. And so, final thing, current work. Where are we right now? So, it was the first attempt before with the Cretan Dance students on coming up with a system that can export staff notation. We are taking this further now to also support polyphonic music. This is in collaboration with the University of Kyoto to create a system that can output staff notation even in the case of complex polyphony. And this is where we are right now. And the question is where to go next. So, I believe that these are automatic transcription technologies can be really useful methods and tools to enable further research in digital ethnomusicology. However, there are existing challenges both with respect to data availability and not talking about only audio itself, but also to annotated data. And also the so-called Western bias that is sort of predominant in MIR research. And an interesting challenge also that I think ties up quite nicely to this symposium, I think, was a question that has been posed by some librarians now that we are moving closer towards publishing, to creating actual staff notation. So, before, with existing IP regulations, we could easily sort of share with the wider public derived features. As long as those features could not lead back to the original audio, that's fine. There's no problem with that. But when you are creating a system that can output proper staff notation, then from the perspective of a library, this might be considered publishing. So, it's not, I mean, a score is no longer a derived feature. It's something more than that. And there are IP issues. So, who owns the IP on that? Is it the composer, if they exist? Is it the performer out of whom the transcription was made? Is it the person who wrote the software? Is it the training data? I think Bob has also dealt with some similar issues in the past about sort of copyright attribution in these cases. And this is, I think, this is still sort of an open problem that we have to deal with in the future as these transcription approaches become more and more usable. And that's it. Thank you very much. And these links, you can also download some of these automatic transcription tools if you want to try them out. Thank you very much. Wow. Fantastic. And we're going to towards the closing. Thank you very much. Thank you. Thank you. Appreciate it. Thank you. Thank you. Thank you. Okay.