Diacritics correction in Turkish with context-aware sequence to sequence modeling

Download

index.pdf

Date

2022-1-01

Author

Köksal, Asiye Tuba
Bozal, Özge
Özge, Umut

Metadata

Show full item record

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Item Usage Stats

244
views

293
downloads

Digital texts in many languages have examples of missing or misused diacritics which makes it hard for natural language processing applications to disambiguate the meaning of words. Therefore, diacritics restoration is a crucial step in natural language processing applications for many languages. In this study we approach this problem as bidirectional transformation of diacritical letters and their ASCII counterparts, rather than unidirectional diacritic restoration. We propose a context-aware character-level sequence to sequence model for this transformation. The model is language independent in the sense that no language-specific feature extraction is necessary other than the utilization of word embeddings and is directly applicable to other languages. We trained the model for Turkish diacritics correction task and for the assessment we used Turkish tweets benchmark dataset. Our best setting for the proposed model improves the state-of-the-art results in terms of F1 score by 4.7% on ambiguous words and 1.24% over all cases.

Subject Keywords

Natural language processing, diacritics restoration, diacritics correction, sequence to sequence learning, LSTM, RESTORATION, diacritics correction, diacritics restoration, LSTM, Natural language processing, sequence to sequence learning

URI

https://hdl.handle.net/11511/101812

Journal

Turkish Journal of Electrical Engineering and Computer Sciences

DOI

https://doi.org/10.55730/1300-0632.3948

Collections

Graduate School of Informatics, Article

Suggestions

OpenMETU
Core

Abstractive text summarization on WikiHow dataset using sentence embeddings Tozyılmaz, Bahattin; Alemdar, Hande; Department of Computer Engineering (2019) Summarization is a well known natural language processing task that is used in our day-to-day lives. The field saw recent research using neural networks and word embeddings. We use WikiHow dataset and show that we can match performance of a similar model using sentence embeddings, and using abstractive summarization. We show that we can use sentence embeddings and lower input data size without impacting performance too much.
Frequency-driven late fusion-based word decomposition approach on the phrase-based statistical machine translation systems Tatlıcıoğlu, Mehmet; Yazıcı, Adnan; Department of Computer Engineering (2013) Machine translation is the process of translating texts from a natural language to another by computers based on linguistic motivations, statistical approaches, or the combination of them. In this study, the frequency-driven late fusion-based word decomposition approach is introduced to improve the translation quality of the phrase-based statistical machine translation system from Turkish to English. This late fusion-based approach is compared with the standalone statistical and rule-based word decompositio...
Morphological processing of inflected and derived words in L1 Turkish and L2 English Şafak, Duygu Fatma; Kırkıcı, Bilal; Department of English Language Teaching (2015) The present study aims at examining how inflected and derived words are processed during the early stages of visual word recognition in a native language (L1) and in a second language (L2). A second aim of the study is to find out whether or not the semantic and surface-form properties of morphologically complex words affect early word recognition processes. Two masked priming experiments were conducted to investigate morphological processing in L1 Turkish and in L2 English. In the first experiment, 40 L1 s...
Single exponent in l1 multiple exponents in l2: consequences for l2 Kurumlu, Zehra; Gracanın Yüksek, Martına; Department of English Language Teaching (2013) The thesis hypothesized that when an exponent of a linguistic concept in the native language maps onto several different exponents in the target language, learners have difficulty when acquiring those structures in the target language. By contrast, when an exponent of a linguistic concept in the native language and its counterpart in the target language stand in a one-to-one correspondence, the possibility of making errors decreases to a considerable extent. In order to test this hypothesis, I examined thre...
Head finalization and morphological analysis in factored phrase-based statistical machine translation from English to Turkish İmren, Haydar; Çakıcı, Ruket; Department of Computer Engineering (2015) Machine Translation is a field of study which deals with translating text from one natural language to another automatically. Statistical Machine Translation generates the translations using statistical methods and bilingual text corpora. In this study, an approach for translating from English to Turkish is introduced. Turkish is an agglutinative language with a free constituent order, whereas English is not agglutinative and the constituent order is strict. Besides these differences, there is a lack of par...

Citation Formats

A. T. Köksal, Ö. Bozal, and U. Özge, “Diacritics correction in Turkish with context-aware sequence to sequence modeling,” Turkish Journal of Electrical Engineering and Computer Sciences, vol. 30, no. 6, pp. 2433–2445, 2022, Accessed: 00, 2023. [Online]. Available: https://hdl.handle.net/11511/101812.