Speaker 1: Hi, my name is Kim White-Chirk from Singapore University of Technology and Design. Today, I am going to present a paper called Recon VAT, a semi-supervised automatic music transcription framework for low-resource real-world data. This work is in collaboration with my supervisors Dorian Hurrimans and Li Su from Academia Sinica. Before we dive into the details, let's have a high-level summary on what this work is about. Recon VAT is a semi-supervised framework for automatic music transcription, or simply AMT. This framework outperforms other state-of-the-art supervised models and generalizes better in low-resource settings. The transcription accuracy can be improved using unlabeled data without framework. Here is the table of contents of this talk. This talk is divided into different subsections. You may skip to the section that you are interested in if you like, or you can follow my presentation from the beginning to the end. First, let's talk about the background information. What is AMT? AMT is the process of transforming audio data into symbolic representations such as music scores. It is analogous to automatic speech recognition in which we transform audio data into text. Why do we choose to study AMT instead of other tasks? Because AMT is a fundamental information retrieval task that has various important downstream applications. Extracting information directly from the raw audio is difficult. Most of the time, we want to have an intermediate representation of the audio. In the case of music audio, the symbolic representation should contain pitches and rhythms of the music. Once we have this symbolic representation, we can use it for music indexing, music generation, music recommendation, and music analysis. Now, let's talk about the problem definition for AMT. Extracting information directly from the time domain is difficult. Therefore, we define our problem as spectrograms to piano rolls conversions, which is very similar to the task of image segmentation. The only difference is that the frequency dimension of the spectrograms is different from the frequency dimension of the piano rolls. Once we have the piano roll, it can be easily converted into other representations such as music scores. Next, we will talk about the motivation of applying semi-supervised learning on AMT. Currently, fully supervised model can achieve a high transcription accuracy on piano music because we have a several big enough annotated piano music datasets. But in the case of Shane's and Woodwin's music, we don't have enough labeled data, and therefore, fully supervised models fail to achieve a good transcription accuracy. Existing unsupervised music transcription models only work for a specific instrument type, for example, drums and pianos. Our proposed semi-supervised framework aims at achieving high transcription accuracy for various instrument types with limited labels. This is the schematic diagram of our Recon VAT semi-supervised framework. It consists of three branches from left to right. The reconstruction branch, the original branch, and the FSRVIA training branch. First, let's start with the middle branch. We have our original spectrogram here, and we fit it to the transcription model and obtain the posteriorgram. Then, we use this posteriorgram as the input to a generator and generate the reconstructed spectrogram. And the reconstructed spectrogram is fed to the same transcription model and obtain our second posteriorgram. Here, we have three different loss terms. The mean square error between the original spectrogram and the reconstructed spectrogram. The binary cross entropy between the ground truth label and the first posteriorgram. We also have another binary cross entropy between the ground truth label and the second posteriorgram. Now, let's focus on the right-hand side of our framework. I will explain more about VAT later. But now, let's accept the fact that we need the spectrogram and the posteriorgram to calculate the FSRVIA vector. Once we have the FSRVIA vector, we add it to the original spectrogram and obtain our FSRVIA spectrogram. We fit our FSRVIA spectrogram to the transcription model again and obtain our third posteriorgram. The local distributional smoothness or LDS between the first and third posteriorgram should be minimized. More information about LDS will be introduced later. We modify our previous transcription model and use it in our current framework. The first modification that we made is the support for multi-channel output. In this paper, we only explore two cases. The one-channel case with only frame features. And the two-channel case with frame features plus onset prediction. We also replace all the LSTM layers with self-attention layers to summarize all the channels and output the final posteriorgram. Now, we have a high-level understanding of the semi-supervised framework. It's time to dive into the details of FSRVIA training. FSRVIA training was first proposed by Ian Goodfellow in 2015. In his paper, he discovered that even with a well-trained classifier, adding a small amount of noise to the input can greatly affect the classifying result even though the noise does not change the visual appearance of the input at all. And therefore, he proposed FSRVIA training to improve the robustness of the model. The basic idea behind it is that we find the FSRVIA vectors that could break our model and then we add this FSRVIA vector to the input and train our model with this FSRVIA example. The problem is, in order to find the FSRVIA vector, we need to have labels. To extend the concept of FSRVIA training to unlabeled data, Mieto et al. proposed a method called virtual FSRVIA training. Under this framework, we can obtain the FSRVIA vector without using any label. We just need the model prediction using the original input and the model prediction using an input with random noise. Then, we calculate the loss between the two predictions. Finally, the FSRVIA vector is the gradient of the loss function with respect to the noise r. The intuition behind this method is that we first move our sample x with a random vector r. Then, we measure the degree of changes in label space with a loss function d. After that, we find the direction of r that causes the greatest change in d by differentiating d with respect to r. We use this direction as our FSRVIA vector and use epsilon to control the magnitude of this FSRVIA vector. In case there are some audience who are more interested in the mathematical explanation on how to obtain the FSRVIA vector, I am going to spend a few minutes to go through the derivation of it. For those who are only satisfied with the intuitive explanation, you can skip these two slides and jump to the objective function slide. The derivation starts with the loss between the prediction obtained from the clean input and the prediction obtained from the noisy input. When we try to find the FSRVIA vector, the model weight theta is frozen. For simplicity, in the following discussions, we will denote the loss as d bracket r. Now, let's apply Taylor expansion on d at r equal to 0. We only keep the first three terms from the Taylor expansion and discard the remaining terms. The first term d0 is 0, because the prediction obtained from the original input should be the same as the prediction obtained from the same input without adding any noise. Gradient of d0 should also be 0, because we expect that the loss function attains its minimum as 0, which is true for most of the loss functions. After cancelling two terms, we only have one term left. Since we know that taking the gradient two times is same as a hessian, we will use the hessian symbol h to represent the last term. Now, the last term becomes like this. Remember, we are trying to find the direction of r that results in the largest value of d. If we assume r is normalized to 1, it becomes a boundary for quadratic form problem. Luckily, the solution for quadratic form is well known, where d attains its maximum when r is the dominant eigenvector of the hessian. The search for the adversarial vector has now become the search for the eigenvector. One way to find the eigenvector is to use the power iteration algorithm. In order for this algorithm to work, we need to have a symmetric matrix, and we multiply this matrix to a randomly initialized vector d for k times. Eventually, d will converge to the eigenvector of h. Since the hessian matrix is symmetric, we can simply use this algorithm. One last problem is that calculating the hessian is very computationally expensive. In the original paper of VAT, they use a final trick to approximate the hessian matrix using this expression. After simplification, we get the same expression for the adversarial vector as before. Since we have multiple components in our VConn VAT framework, we have multiple objectives to be minimized. The first objective is the local distributional smoothness or simply LDS. It is the loss between the prediction obtained from the clean sample and the prediction obtained from the adversarial sample. We want to minimize LDS because we don't want any change in the prediction when there is a small amount of noise in the input. In our case, we use the binary cross entropy as our loss function. The S check here means that we can apply our LDS on both labeled and unlabeled data. Next, we have the spectrogram reconstruction loss. This objective is simply a mean square error between the original spectrogram and the reconstructed spectrogram. Last but not least, we have the transcription loss. It consists of two terms. The first term is the binary cross entropy between the ground-truth piano roll and the posteriogram. And the second term is the binary cross entropy between the ground-truth piano roll and the posteriogram obtained from the reconstructed spectrogram. Each term of the transcription loss can be applied to the onset prediction or the posteriogram prediction. In the case of one-channel output unit, we only apply the transcription loss on the posteriograms. The final objective to be minimized is given here. Alpha is a scaling factor for LDS. It is set to 1 through the experiments. Before I start introducing the experiments, I would like to explain the evaluation metrics first. In my following experiments, I'm going to use three different metrics. The frame metric, the nocked metric, and the nocked with offset metric. Frame metric is a pixel-by-pixel evaluation on the piano roll. It treats piano rolls as images. The nocked metric extracts nocked objects from the piano rolls. If two nocked objects have the same pitch and onset location, then it is considered as a correct prediction. Nocked with offset metric is very similar to nocked metric, but it also considers the offset location of the objects. Let's consider the following example. Given a ground truth and two predictions, we want to know which metric is reflecting the true transcription accuracy. If we take a look at the frame-wise F1 score, the F1 score for prediction 1 is higher than prediction 2. It is because prediction 1 has more correct pixels than prediction 2. But prediction 1 doesn't make any musical sense compared to the ground truth. In the ground truth, we have in total 6 nocked objects. But in prediction 1, we have 35 nocked objects. That's why the nock-wise F1 score for prediction 1 is so low. For prediction 2, it has exactly 6 nocked objects as the ground truth. And the 6 nocked objects have the same pitch and onset locations as the ground truth. That's why the nock-wise F1 score for prediction 2 is 100. But all the nocked objects in prediction 2 are too short. None of the objects in prediction 2 has the same offset location as the ground truth. That's why we obtain a 0 F1 score for nocked with offset metric. In this example, we can see that nocked metric and nocked with offset metric reflect the transcription accuracy better than the frame metric. Now that we understand how our evaluation works, let's go through the experiments. Our first experiment will be on piano datasets. We use MAPS as our supervised dataset, and MISTRO as our unsupervised dataset. We also downsample all the recordings into 16kHz, and extract MEL spectrograms from it. The MEL spectrogram has a window length of 2048, and 229 MEL frequency bins. We have 3 different versions of MAPS for this experiment. The first version is the full MAPS dataset, in which we remove the overlapping pieces between the training set and the test set, leaving behind 139 pieces as the label training set. We use the training set for MISTRO as our unlabeled training set, which consists of 967 pieces. The test set is the standard 60 pieces for MAPS. In the small version of MAPS, we use only 23 pieces as the label training set. We keep the unlabeled training set and test set same as before. Finally, we also have the one-shot version of MAPS, in which we use only one piece as the label training set. This is the result for our experiments. As we can see from the table, if we have enough labeled data, our Recon VAT is as good as the baseline onsets and frames model. As we reduce the size of labeled training data, our Recon VAT starts to outperform the baseline model. In the case of one-shot learning, our Recon VAT is much better than the baseline model. Our second experiment is on the string dataset. We extract 8 string pieces from MusicNet as our label training set, and then we extract 104 pieces as our unlabeled training set. We use 4 pieces from the MusicNet as our test set. This is the result for the string experiment. Interestingly, using onset doesn't help improving the transcription accuracy in this dataset, and therefore, one of the baselines, onsets and frames model does not perform well in this dataset. Because of this, we use the one-channel output unit, and we use another baseline model that achieves the state-of-the-art performance on this dataset. Nonetheless, our Recon VAT is still better than both of the baseline models. There are two possible reasons why onset doesn't work for this dataset. The first reason is that the filed-in onsets are not as obvious as piano onsets, as shown in the figure on the right-hand side. The second reason is that the labels for this dataset are not completely accurate, as pointed out by other researchers. Finally, we also test our model on a woodwind dataset. Again, we extract 6 pieces from the MusicNet as our labeled training data, and then we extract another 21 pieces as our unlabeled training data. The experimental results show that our Recon VAT is still better than the baseline model. Since onset information doesn't work for this dataset either, we ignore the result produced by the models that use onset information. To conclude, our Recon VAT outperforms the baseline models in the low-resource setting. Our proposed framework not only works on piano dataset, but also on strings and woodwinds dataset. We believe that our Recon VAT would also be useful for other kinds of musical instruments. Now, let's listen to one transcription example produced by the baseline model and our Recon VAT. This is the ground truth audio. This is the transcription result from the baseline model. This is the transcription result from our Recon VAT. You can get access to more demonstration pieces from our demo page. The link to our demo page is given by the QR code here. We will conclude by showing you the demo page. We will conclude our presentation by using this figure here. This figure shows how our Recon VAT compared to other baseline models in terms of number of parameters and transcription accuracy. As you can see from the figure, using VAT can boost the model performance without adding extra parameters. If we combine the idea of VAT and spectrogram reconstruction, we obtained our Recon VAT framework. The source code of this paper is available by scanning the QR code on the left-hand side. The demo page is also available by scanning the QR code in the middle. Finally, the library and an audio that we use to extract spectrograms is also available by scanning the QR code on the right-hand side. This is the end of my presentation. Thank you for listening. If you have any other questions, feel free to email me at this email address.
Generate a brief summary highlighting the main points of the transcript.
GenerateGenerate a concise and relevant title for the transcript based on the main themes and content discussed.
GenerateIdentify and highlight the key words or phrases most relevant to the content of the transcript.
GenerateAnalyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.
GenerateCreate interactive quizzes based on the content of the transcript to test comprehension or engage users.
GenerateWe’re Ready to Help
Call or Book a Meeting Now