Show/Hide Menu
Hide/Show Apps
Logout
Türkçe
Türkçe
Search
Search
Login
Login
OpenMETU
OpenMETU
About
About
Open Science Policy
Open Science Policy
Open Access Guideline
Open Access Guideline
Postgraduate Thesis Guideline
Postgraduate Thesis Guideline
Communities & Collections
Communities & Collections
Help
Help
Frequently Asked Questions
Frequently Asked Questions
Guides
Guides
Thesis submission
Thesis submission
MS without thesis term project submission
MS without thesis term project submission
Publication submission with DOI
Publication submission with DOI
Publication submission
Publication submission
Supporting Information
Supporting Information
General Information
General Information
Copyright, Embargo and License
Copyright, Embargo and License
Contact us
Contact us
Diacritics correction in Turkish with context-aware sequence to sequence modeling
Download
index.pdf
Date
2022-1-01
Author
Köksal, Asiye Tuba
Bozal, Özge
Özge, Umut
Metadata
Show full item record
This work is licensed under a
Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License
.
Item Usage Stats
202
views
239
downloads
Cite This
Digital texts in many languages have examples of missing or misused diacritics which makes it hard for natural language processing applications to disambiguate the meaning of words. Therefore, diacritics restoration is a crucial step in natural language processing applications for many languages. In this study we approach this problem as bidirectional transformation of diacritical letters and their ASCII counterparts, rather than unidirectional diacritic restoration. We propose a context-aware character-level sequence to sequence model for this transformation. The model is language independent in the sense that no language-specific feature extraction is necessary other than the utilization of word embeddings and is directly applicable to other languages. We trained the model for Turkish diacritics correction task and for the assessment we used Turkish tweets benchmark dataset. Our best setting for the proposed model improves the state-of-the-art results in terms of F1 score by 4.7% on ambiguous words and 1.24% over all cases.
Subject Keywords
Natural language processing
,
diacritics restoration
,
diacritics correction
,
sequence to sequence learning
,
LSTM
,
RESTORATION
,
diacritics correction
,
diacritics restoration
,
LSTM
,
Natural language processing
,
sequence to sequence learning
URI
https://hdl.handle.net/11511/101812
Journal
Turkish Journal of Electrical Engineering and Computer Sciences
DOI
https://doi.org/10.55730/1300-0632.3948
Collections
Graduate School of Informatics, Article
Suggestions
OpenMETU
Core
Abstractive text summarization on WikiHow dataset using sentence embeddings
Tozyılmaz, Bahattin; Alemdar, Hande; Department of Computer Engineering (2019)
Summarization is a well known natural language processing task that is used in our day-to-day lives. The field saw recent research using neural networks and word embeddings. We use WikiHow dataset and show that we can match performance of a similar model using sentence embeddings, and using abstractive summarization. We show that we can use sentence embeddings and lower input data size without impacting performance too much.
Frequency-driven late fusion-based word decomposition approach on the phrase-based statistical machine translation systems
Tatlıcıoğlu, Mehmet; Yazıcı, Adnan; Department of Computer Engineering (2013)
Machine translation is the process of translating texts from a natural language to another by computers based on linguistic motivations, statistical approaches, or the combination of them. In this study, the frequency-driven late fusion-based word decomposition approach is introduced to improve the translation quality of the phrase-based statistical machine translation system from Turkish to English. This late fusion-based approach is compared with the standalone statistical and rule-based word decompositio...
Morphological processing of inflected and derived words in L1 Turkish and L2 English
Şafak, Duygu Fatma; Kırkıcı, Bilal; Department of English Language Teaching (2015)
The present study aims at examining how inflected and derived words are processed during the early stages of visual word recognition in a native language (L1) and in a second language (L2). A second aim of the study is to find out whether or not the semantic and surface-form properties of morphologically complex words affect early word recognition processes. Two masked priming experiments were conducted to investigate morphological processing in L1 Turkish and in L2 English. In the first experiment, 40 L1 s...
Single exponent in l1 multiple exponents in l2: consequences for l2
Kurumlu, Zehra; Gracanın Yüksek, Martına; Department of English Language Teaching (2013)
The thesis hypothesized that when an exponent of a linguistic concept in the native language maps onto several different exponents in the target language, learners have difficulty when acquiring those structures in the target language. By contrast, when an exponent of a linguistic concept in the native language and its counterpart in the target language stand in a one-to-one correspondence, the possibility of making errors decreases to a considerable extent. In order to test this hypothesis, I examined thre...
Head finalization and morphological analysis in factored phrase-based statistical machine translation from English to Turkish
İmren, Haydar; Çakıcı, Ruket; Department of Computer Engineering (2015)
Machine Translation is a field of study which deals with translating text from one natural language to another automatically. Statistical Machine Translation generates the translations using statistical methods and bilingual text corpora. In this study, an approach for translating from English to Turkish is introduced. Turkish is an agglutinative language with a free constituent order, whereas English is not agglutinative and the constituent order is strict. Besides these differences, there is a lack of par...
Citation Formats
IEEE
ACM
APA
CHICAGO
MLA
BibTeX
A. T. Köksal, Ö. Bozal, and U. Özge, “Diacritics correction in Turkish with context-aware sequence to sequence modeling,”
Turkish Journal of Electrical Engineering and Computer Sciences
, vol. 30, no. 6, pp. 2433–2445, 2022, Accessed: 00, 2023. [Online]. Available: https://hdl.handle.net/11511/101812.