20,000+ Professional Language Experts Ready to Help. Expertise in a variety of Niches.
Unmatched expertise at affordable rates tailored for your needs. Our services empower you to boost your productivity.
GoTranscript is the chosen service for top media organizations, universities, and Fortune 50 companies.
Speed Up Research, 10% Discount
Ensure Compliance, Secure Confidentiality
Court-Ready Transcriptions
HIPAA-Compliant Accuracy
Boost your revenue
Streamline Your Team’s Communication
We're with you from start to finish, whether you're a first-time user or a long-time client.
Give Support a Call
+1 (831) 222-8398
Get a reply & call within 24 hours
Let's chat about how to work together
Direct line to our Head of Sales for bulk/API inquiries
Question about your orders with GoTranscript?
Ask any general questions about GoTranscript
Interested in working at GoTranscript?
Speaker 1: Hello everyone, my name is Lu. Today I'm going to talk about automatic drum transcription using audio and visual analyzer. So starting from the general music transcription, it is a technique which transforms music signal into music notations. Looking through all the researchers, we find that most research focus on the four instruments, where there are three pitched instruments, piano, violin, and cello. The core problems for the transcription system for this kind of instruments lies on the multi-pitch detection. While for the drums, it's representative of percussive sound transcription systems, which has received less attention. So that's why I'm focusing on this. So what's the benefit for this system? First, it defines the music style and structure. Second, it contains the rhythm and tempo information. And for real-life cases, it helps musicians for doing music composition, and it helps people for recording live performance for repeat use. And the last is what I found, because recently I'm playing drums, I found that drum scores is hard to find for some popular songs, and that's why this will be help. So this is a basic set of drum kits. It consists of five categories, increase bass drum, snare drum, hi-hat, and two cymbals, and three tom-toms. And these are the notations for matching these drum sets. As my survey is focusing on the transcription process, so the visualization part is not included in this. Thus, we identified the three components of these systems, which compose of drum extraction, onset detection, and feature extraction. This gives a general view of how a music file looks like in a spectrogram. So starting from the drum extraction, here we use the most popular approach, non-negative matrix factorization, which is pretty cool for analyzing different components in polyphonic music. Here the V matrix represents for the spectrogram of the whole music file. And the purpose we want is, we want to depose it into a product of two non-negative matrix, where the W matrix stands for the magnitude spectra of the instrument components, and the H matrix stands for the activation of each component with respect to time. For optimal results for this decomposition, we introduce this cost function. When it reaches its minimum, thus we got the optimum of the decomposition. The next stage is the onset detection, where the onset stands for the start of a sound in a music. Let's just go to the next slide to grab how it works. Just see here, it's the music signal, and the red lines just mark the onset parts. And this is in a separate instrument, how we find the peak part at the onset positions. And then we introduce the techniques inside of this process. We have a detection function, which it calls the spectral flux method minus the threshold. Firstly, we use short time Fourier transform to make the time sequence into a frequency domain. Next, we use the spectral flux method to grab the whole energy across all frequencies and beams. Next, we have defined a dynamic threshold for picking up the peaks, which stands for the onset. If the detection function has a result larger than zero, then we get the onset position. Okay, so the next stage already to feature extraction. I don't really focus much on this, but through all the works, there are main two aspects of the features. So first contains the temporal features, which you can see here. It consists of short time energies, magnitude averages, and extra. And spectral features, also it's spectral concentrates, valence, and divergence, extra. These are all calculated from the signals, so we don't focus more on that. As most researchers focus on the audio analysis, and nowadays there are large available videos. And in real life cases, how we do to get a drum score is to manually transcribe from these videos. So we add video analysis makes sense for helping improve this kind of systems. As I found there's not much research on this. There are two general approaches. The diagram shows one approach, which did top recognition and detect the headings on the drum tops. And we use this to generate the video transcription system, and we add them to the audio transcription systems. They would generate outputs. So for the two stages in these approaches, the drum top detection uses ellipse detection algorithms to detect the upper areas of the drums. And the heads detection, we tried to extract silhouettes of the drummers and drumsticks, and to define the hits actually. And there's also another approach, which defines a 2D model. There are two weighing masks defined in these approaches, which stands for motion of the sticks and the gestures, and the motion of the instruments themselves, respectively. So by calculating the intensity of the two kind of motions, we can get some features out of that. So that's all about the work that has been done recently. So the further researchers could rely on some other aspects. For example, there's a lack of database, because the current experiments have most carried on the private data set, which is small and maybe not that reliable. So that may influence the output results. And the second, which I find important, because when I try to look for some drum scores, I think there are missing surveys that put attention on the drum transcription for the currently popular songs. Which is actually useful for the drum learners and the drum players. And the third is looking at the drum scores, we have some text explanations, like the dynamics and expression musical markings, which can't be extracted for now. So the higher level music information could be further research area for this. And also about the video part, extract features from different scenarios of performance videos, which means when we try to add a nice video, from the top side, we can get a good result. But maybe it's from another side. It all depends on the cameras. If we don't have a good position, maybe we can't have a nice output. So we could put much attention on the feature extraction from different positions of the cameras. And a better few methods for video and audio fusions. Because the currently method is just simply, we're using one method and we're using another classifier for help. For example, when the audio does not work, we use a video. So maybe we could use them both somehow. That's all. Thank you.
Generate a brief summary highlighting the main points of the transcript.
GenerateGenerate a concise and relevant title for the transcript based on the main themes and content discussed.
GenerateIdentify and highlight the key words or phrases most relevant to the content of the transcript.
GenerateAnalyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.
GenerateCreate interactive quizzes based on the content of the transcript to test comprehension or engage users.
GenerateWe’re Ready to Help
Call or Book a Meeting Now