Discovering Vocabulary through Cross-Lingual Alignment


The basic approach
The basic approach Zoom

In this thesis, we explore cross-lingual information to discover the vocabulary of an unseen target language. Starting out from a phoneme recognizer, we learn the alignment between the target language phoneme sequence and the source language word sequence. We utilize this alignment to group the target language phoneme sequence into words and extract from this the vocabulary of the target language. Our approach only requires a phoneme recognizer in a related source language, written sentences in the source language, and their spoken translations in the target language. Our proposed methods compensate for alignment and phoneme recognition errors. For evaluation purpose, we collected a small new corpus (Basic Medical Expression Database) consisting of 200 parallel sentences in English, German, Croatian and Slovene and 2.5 hours speech data in Croatian and Slovene. The resulting approach is highly relevant to Machine Translation (MT) and Automatic Speech Recognition (ASR) systems, particularly in the context of under-resourced languages and those which are not written at all.