Interactive Digital Violin Tutor: Enhancing Learning with Audio-Visual Fusion
Explore the IDVT system, combining audio-visual inputs for accurate violin transcription, offering effective, self-paced learning for beginners.
File
iDVT Demo Automatic Audio-Visual Violin Transcription - 12
Added on 09/06/2024
Speakers
add Add new speaker

Speaker 1: IDVT demonstration, automatic violin transcription using audio-visual fusion. Authors are Lu Huanhuan, Zhang Bingjun, Wang Ye, and Liao Weiqing from National University of Singapore. There are four parts in this video demo. The motivation and background, the technical part, which is the audio-visual violin transcription, and the IDVT setup in the home environment. And at last, the showcase of the IDVT in use. IDVT stands for the Interactive Digital Violin Tutor. The motivation for us to develop a system called Interactive Digital Violin Tutor is to follow the pedagogical foundation from David Perkin. We quote from his works, People learn much of what they have a reasonable opportunity and motivation to learn. There are four essential aspects for effective learning. Clear information, thoughtful practice, and informative feedback has strong intrinsic and extrinsic motivations. By keeping the pedagogical foundation in mind, we intend to develop the IDVT which makes the violin learning effective and fun. It provides self-paced learning, and also is accessible anytime, anywhere. It provides a constructive learning environment for the beginning violin learners. The critical part of this system is the violin transcription, which converts the audio or visual input of the violin music into MIDI notation. Which is the feedback for the beginning violin learners, and for them to identify the errors of their playing in order to improve their playing techniques. As we all know, violin music is a pitched, non-percussive music. Most of the time, violin produces monophonic sounds, but very rarely it produces polyphonic sounds like double stops or triple stops. Therefore, the pitch estimation part is largely a monophonic pitch estimation, which is considered soft. However, the onset detection of the violin music is much difficult. Because of the poor onset detection performance, the audio-only violin transcription is not performing very well. That's why, by the following motivation, we include the visual part to help the violin transcription in order to provide more accurate feedback. In violin music, the bow stroke reversals and the vertical movements are associated with note onsets. The trajectories of fingers are also associated with note elements. Therefore, we can include the visual parts to help the audio-only transcription. In addition, the useful visual learning feedback, such as the playing gestures and the fingering on the bowing trajectories, can also be fed back to the users for them to use as a reference during the learning process. Next, let's take a look at the technical part of the IDBT, which is the audio-visual violin transcription. Here is the system diagram for this demo. In the audio processing part, this demo takes audio input and extracts the MFCC feature and then uses GMM to derive an audio-only detection function. After that, at the same time, the video part is being processed. From the bowing video and the fingering video, the system extracts the bowing detection function and the fingering detection function prospectively. Next, in the data fusion part, we use SVM-based fusion to combine the audio and visual information in order to provide a more effective audio-visual detection function. And then, onset times are picked up. So far, until this step, we have finished the onset detection. Once we know the onset, the whole violin piece will be divided into a series of notes. After that, we apply the pitch estimation to produce the final MIDI notation. And then the MIDI notation will be given, will be shown to the users for them to identify his own playing, whether it's correct or wrong. Next, let's take a look at the audio processing part. In this part, we propose a new algorithm, which is to use GMM Gaussian mixture models to classify the MFCC features called frequency-capacitor coefficients. This method outperformed the state-of-the-art methods listed in reference 3, 12, 13, and 22 by 10% f-merger in less noisy conditions. We tested those methods in a pre-recorded database. In the right figure, we can see that in the MFCC feature, we can distinguish the onset and offset frames. Therefore, we can use GMM to classify the onset times effectively. And after the audio processing, we have the video processing part. By following the motivation we have introduced so far, we try to explore the correlations between the visual features of the violin playing and the acoustic violin music. In the fingering video, we can detect the fall strings and the finger positions on the strings by the algorithm we have developed so far. The tracking can be done in real time. After the tracking, we can obtain the finger press and release moments. By using those visual features, we can construct a fingering onset detection function in which we have peaks indicating the onset times, which is shown in the right figure. Once we have the onset detection function, we can use it in the later audio-visual fusion part to produce more accurate onsets. To emphasize, the tracking algorithm can be done in real time. If the video is not playing smoothly here, you can view it online by following this link. The bowing analysis part follows the same philosophy as the fingering analysis. We use a video processing algorithm to track the hand motion in order to get the bow reversal moments. According to the winding plane techniques, the bow reversal moments corresponds very likely to onset times. We use these visual features to construct the bowing onset detection function in which peaks indicate onset times. The hand tracking part can be achieved in real time. After the audio and video processing, we use the SVM-based fusion to combine the audio and video processing. We use the SVM-based fusion for visual onset detection functions to produce more accurate onsets. Tested on the database, the SVM-based fusion can improve the onset detection by 5-18% fM compared with audio-only approach in different noisy conditions, and therefore the transcription accuracy is improved by 14-20%. So we can see the next figure shows the improvement trend. The XS shows four databases with different noise ratios. As the audio part gets more and more noisy, the improvement brought by the visual information is getting larger and larger in both onset detection and the overall transcription. Therefore, we can see that the visual part, in this case, is helping a lot to provide more accurate transcription results, which is critical for the interactive digital winding tutor.

ai AI Insights
Summary

Generate a brief summary highlighting the main points of the transcript.

Generate
Title

Generate a concise and relevant title for the transcript based on the main themes and content discussed.

Generate
Keywords

Identify and highlight the key words or phrases most relevant to the content of the transcript.

Generate
Enter your query
Sentiments

Analyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.

Generate
Quizzes

Create interactive quizzes based on the content of the transcript to test comprehension or engage users.

Generate
{{ secondsToHumanTime(time) }}
Back
Forward
{{ Math.round(speed * 100) / 100 }}x
{{ secondsToHumanTime(duration) }}
close
New speaker
Add speaker
close
Edit speaker
Save changes
close
Share Transcript