Swiss German Speech-to-Text: Multi-Dialect Framework
Exploration of Swiss German dialect speech recognition, using different transcription methods and advanced language models.
File
Swiss German speech-to-text with Kaldi
Added on 01/29/2025
Speakers
add Add new speaker

Speaker 1: Hello, my name is Yulia and together with Dan we'll present the first multi-dialect speech-to-text framework for Swiss German dialects. We'll show the results of a few systems we have trained with a particular focus on the difference between training on two different types of transcriptions. To begin with, let me briefly remind you about the typical automatic speech recognition pipeline. The automatic speech recognition task is to map an acoustic signal, which is usually split into frames, to a sequence of words. The automatic speech recognition pipeline consists of the data preparation stage, including feature extraction, then training three probabilistic models, and the stage, which is called decoding, where three models are combined. Three models represent different aspects of language. Acoustic model, pronunciation model, and language model. Acoustic model serves for mapping the acoustic signal, feature vectors, to the phonemes. Pronunciation model or dictionary finds the best correspondence between a sequence of phonemes and graphemes, creating a pronunciation lexicon, which is usually built by professional linguists or trained separately if some training data is available. Finally, language model generates the most probable sequences of words in a language, using the output from the acoustic and pronunciation models. The writing for the output transcriptions is one of the challenges for the Swiss-German automatic speech recognition, that is characterized by strong regional variations on all linguistic levels and by the lack of standard orthography. For our experiments, we use the data from the ACIMOB corpus of Swiss-German, which provides two types of transcriptions. First, the dialectal transcriptions, which are based on the spelling principles proposed by Eugen Dietz, and are intended to convey the true sound of spoken Swiss-German using spelling conventions from standard German. Second type of transcriptions is a semi-automatically normalized version of the dialectal transcriptions, where all surface form spelling variants of the same word are mapped to a single normalized form. Thus, such normalization cannot be seen as translation, since a sentence structure always stays unmodified and only word writing is changed. The ACIMOB corpus consists of interview recordings with native speakers in 14 different Swiss-German dialects, totaling approximately 70 hours of raw speech data. The data has been manually transcribed by native speaker annotators. The features that we used were extracted from a speech signal are 13-dimensional malfrequency cepstral coefficient features with the first and second derivatives. For our systems, we use the adapted Wall Street Journal recipe provided in the CALDI toolkit. Since no multi-dialectal automatic speech recognition baseline exists for Swiss-German, as a starting point for our experiments, we decided to use a simple CALDI setup as a baseline where the acoustic model is a discriminatively trained deep neural network model. We trained our first acoustic model on the dialectal transcriptions and another acoustic model on the normalized transcriptions and combined this with the pronunciation lexicons and the baseline language models trained on the training data. For the main models, the time-delay neural network architecture from the Wall Street Journal chain recipe was adapted. To increase the amount of training data and improve its robustness, we have performed audio speed perturbation followed by volume perturbation that increased the amount of data by a factor of 3. In addition to the features used in the baseline model, we also included 100-dimensional i-vectors extracted from each speech frame in order to normalize the variation between speakers and dialectal varieties. And again, the models were trained on both types of transcriptions. And now, Taylor will continue.

Speaker 2: Thanks, Iulia. So, for our experiments, we require pronunciation and language models that are tailored specifically towards each particular writing of Swiss German. The pronunciation lexicon for the dialectal writing is gained based on simple grapheme-to-phoneme heuristics. Here, we rely on the fact that the dialectal writing provides this loosely phonemic representation and thus derive a pronunciation string for each word by simply segmenting its constituent graphemes according to certain rules. For example, the word Schwiederkeit is annotated as shown here, with frequently occurring grapheme sequences such as SCH and diphthong vowels mapped to single phoneme symbols. For the normalized writing, such a rule-based approach is not suitable, since the normalized spellings do not closely represent the spoken realization. Recently, however, Schmidt et al. 2019 released an 11,000-word pronunciation dictionary, which maps standard German words to their Swiss German pronunciations in six regional varieties using SAMPHA annotations. We used this lexicon to train a transformer-based grapheme-to-phoneme model on the available dictionary pairs and applied it on the words for which manual SAMPHA annotation is missing. Using these two methods, we were able to gain almost 100% lexical coverage of the training data and around 80% coverage on the held-out development and test sets. All of our speech-to-text systems make use of n-gram language models. We investigated potential improvements from increasing the n-gram order and trying various smoothing techniques and found that trigram models typically performed best for dialectical transcriptions, while 5-gram models provided slight improvements on the normalized writing. For smoothing, we found that interpolated modified kinesin A consistently yielded the best results in terms of test set perplexity. We trained our language models primarily on the Archimaud training data, but also found that a small amount of out-of-domain data from additional Swiss-German and standard German corporate helped to lower test set perplexity slightly. As a result, our experiments also looked at the improvement gained by switching out the baseline language models for ones trained on these additional out-of-domain utterances. Typically, speech-to-text systems are evaluated using word error rate, which calculates the number of edit operations required to transform a system output utterance into a corresponding ground truth utterance. This standard metric doesn't permit spelling variance, and thus is not truly indicative of a speech-to-text system that relies on non-normalized or non-standardized writings, where such variation is common. To account for this, we use a flexible adaption of the word error rate metric that exploits the normalized annotations in the Archimaud corpus and allows for a certain degree of spelling variation in output transcriptions. For example, given a valid reference hypothesis pair such as this one here, which under the standard word error rate metric would be penalized with an error rate of 100%, but since these variants are permissible according to the normalized annotations in the corpus, the adapted word error rate metric doesn't penalize the output transcription at all. We evaluated the performance of three systems for each writing type, and the top figure here shows the overall performance for dialectical writings, with the baseline model achieving a word error rate of 54.39%. Swapping out the basic acoustic model for a time-delay neural network provides significant improvement, reducing the error rate to 42.38%. We also found that substituting the baseline language model with one that uses a small amount of out-of-domain training data dropped this down a little bit further to 42.16%. If we consider the flexible word error rate metric, shown here in orange, overall error rates are considerably reduced for all systems, typically by around 21 percentage points. For the systems trained on normalized writing, shown on the bottom graph, we can see a similar trend. Overall improvements are achieved by the time-delay neural network acoustic model, and then again a very slight improvement is achieved by incorporating some additional out-of-domain data in the N-gram language model. So to sum up, we've demonstrated the feasibility of Swiss German speech-to-text with two potential writings of Swiss German using very limited training data and time-delay neural network models with i-vectors for normalizing inter-speaker variability. Our results have demonstrated that systems trained on normalized writing outperform those trained on dialectal writings when considering only the standard word error rate metric. However, it's also clear that a flexible word error rate metric is indeed required to provide a more accurate picture of recognition errors in Swiss German speech-to-text that uses dialectal writings. So thank you very much for your attention, and we'll open up the floor to questions.

ai AI Insights
Summary

Generate a brief summary highlighting the main points of the transcript.

Generate
Title

Generate a concise and relevant title for the transcript based on the main themes and content discussed.

Generate
Keywords

Identify and highlight the key words or phrases most relevant to the content of the transcript.

Generate
Enter your query
Sentiments

Analyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.

Generate
Quizzes

Create interactive quizzes based on the content of the transcript to test comprehension or engage users.

Generate
{{ secondsToHumanTime(time) }}
Back
Forward
{{ Math.round(speed * 100) / 100 }}x
{{ secondsToHumanTime(duration) }}
close
New speaker
Add speaker
close
Edit speaker
Save changes
close
Share Transcript