Large vocabulary continuous speech recognition for Turkish Using HTK

Çömez, Murat Ali
This study aims to build a new language model that can be used in a Turkish large vocabulary continuous speech recognition system. Turkish is a very productive language in terms of word forms because of its agglutinative nature. For such languages like Turkish, the vocabulary size is far from being acceptable. From only one simple stem, thousands of new word forms can be generated using inflectional or derivational suffixes. In this thesis, words are parsed into their stems and endings. One ending includes the suffixes attached to the associated root. Then the search network based on bigrams is constructed. Bigrams are obtained either using stem and endings, or using only stems. The language model proposed is based on bigrams obtained using only stems. All work is done in HTK (Hidden Markov Model Toolkit) environment, except parsing and network transforming. Besides of offering a new language model for Turkish, this study involves a comprehensive work about speech recognition inspecting into concepts in the state of the art speech recognition systems. To acquire good command of these concepts and processes in speech recognition isolated word, connected word and continuous speech recognition tasks are performed. The experimental results associated with these tasks are also given.
Citation Formats
M. A. Çömez, “Large vocabulary continuous speech recognition for Turkish Using HTK,” M.S. - Master of Science, Middle East Technical University, 2003.