ASR for Non-Written Languages


ASR for Non-Written Languages
ASR for Non-Written Languages Zoom

In this thesis, we study methods to discover words and extract their pronunciations from
audio data for non-written and under-resourced languages. We examine the potential and
the challenges of pronunciation extraction from phoneme sequences through cross-lingual
word-to-phoneme alignment which requires translations in a resource-rich source language.
If only the target language audio is given, the translations can be produced by a human
translator. Otherwise, a human translator speaks the target language utterances from
prompts in the source language. We add the resource-rich source language prompts to help
the word discovery and pronunciation extraction process: By aligning the source language
words to the target language phonemes, we segment the phoneme sequences into word-
like chunks (word segmentation). The resulting chunks are interpreted as putative word
pronunciations but are very prone to alignment and phoneme recognition errors. Thus,
we suggest our alignment model Model 3P (Stahlberg et al., 2012) which is particularly
designed for cross-lingual word-to-phoneme alignment. We present two different methods
(source word dependent and independent clustering) that extract word pronunciations from
word-to-phoneme alignments and compare them. For source word independent clustering,
we suggest an extension to the traditional k-means algorithm that addresses issues when
k-means is used to cluster word pronunciations. We show that all methods compensate for
phoneme recognition and alignment errors. We also extract a parallel corpus consisting
of 15 different translations in ten languages from the Christian Bible to evaluate our
alignment model and error recovery methods. For example, based on noisy target language
phoneme sequences with 45.1% errors, we build a dictionary for an English Bible with a
Spanish Bible translation with 4.5% Out-Of-Vocabulary rate, where 64% of the extracted
pronunciations contain no more than one wrong phoneme. We show that we can improve
results even more by combining multiple source languages. We present a novel method
for combining noisy word segmentations that leads to up to 11.2% relative gain in F-
score. Finally, we use the extracted pronunciations in an Automatic Speech Recognition
system for the target language and report promising word error rates – given that the
pronunciation dictionary and language model is learned completely unsupervised and no
written form for the target language is required for our approach. When multiple source
languages are combined, we can improve ASR accuracy by 9.1% relative compared to the
best system with only one source language, and by 50.1% compared to a monolingual word
segmentation method.