TED Multilingual Discourse Bank (TED-MDB): a parallel corpus annotated in the PDTB style

2020-06-01
Zeyrek Bozşahin, Deniz
Grishina, Yulia
Kurfalı, Murathan
Gibbon, Samuel
Ogrodniczuk, Maciej
TED-Multilingual Discourse Bank, or TED-MDB, is a multilingual resource where TED-talks are annotated at the discourse level in 6 languages (English, Polish, German, Russian, European Portuguese, and Turkish) following the aims and principles of PDTB. We explain the corpus design criteria, which has three main features: the linguistic characteristics of the languages involved, the interactive nature of TED talks-which led us to annotate Hypophora, and the decision to avoid projection. We report our annotation consistency, and post-annotation alignment experiments, and provide a cross-lingual comparison based on corpus statistics.
Language Resources and Evaluation

Suggestions

Multilingual Extension of PDTB-Style Annotation: The Case of TED Multilingual Discourse Bank
Zeyrek Bozşahin, Deniz; Kurfalı, Murathan (null; 2018-05-07)
We introduce TED-Multilingual Discourse Bank, a corpus of TED talks transcripts in 6 languages (English, German, Polish, European Portuguese, Russian and Turkish), where the ultimate aim is to provide a clearly described level of discourse structure and semantics in multiple languages. The corpus is manually annotated following the goals and principles of PDTB, involving explicit and implicit discourse connectives, entity relations, alternative lexicalizations and no relations. In the corpus, we also aim to...
The Corpus of Turkish Youth Language (COTY): The compilation and interactional dynamics of a spoken corpus
Efeoğlu Özcan, Esranur; Işık Güler, Hale; English Language Teaching (2022-9-2)
This study examines the previously unattained research area of contemporary spoken Turkish used in dyadic and multi-party interaction among young speakers of Turkish. For this purpose, a specialized corpus called the Corpus of Turkish Youth Language (CoTY) was compiled as a source of data and as a tool of analysis. Designed to offer a maximally representative sample of Turkish youth talk, the CoTY contains naturally occurring and spontaneous interactional data among young people between the ages of 14-18 fr...
Pair Annotation as a Novel Annotation Procedure: The Case of Turkish Discourse Bank
Demirşahin, Işın; Zeyrek Bozşahin, Deniz (Springer, 2017-01-01)
In this chapter, we provide an overview of Turkish Discourse Bank, a resource of ∼ 400,000 words built on a sub-corpus of the 2-million-word METU Turkish Corpus annotated following the principles of Penn Discourse Tree Bank. We first present the annotation framework we adopted, explaining how it differs from the annotation of the original language, English. Then we focus on a novel annotation procedure that we have devised and named pair annotation after pair programming. We discuss the advantages it has ...
Pair Annotation as a Novel Annotation Procedure: The Case of Turkish Discourse Bank
Demirşahin, Işın; Zeyrek Bozşahin, Deniz (2017-6-17)
In this chapter, we provide an overview of Turkish Discourse Bank, a resource of ∼∼400,000 words built on a sub-corpus of the 2-million-word METU Turkish Corpus annotated following the principles of Penn Discourse Tree Bank. We first present the annotation framework we adopted, explaining how it differs from the annotation of the original language, English. Then we focus on a novel annotation procedure that we have devised and named pair annotation after pair programming. We discuss the advantages it has of...
Linking discourse-level information and the induction of bilingual discourse connective lexicons
Özer, Sibel; Kurfall, Murathan; Zeyrek Bozşahin, Deniz; Mendes, Amália; Oleškevičiene, Giedre Valunaite (2022-6-20)
The single biggest obstacle in performing comprehensive cross-lingual discourse analysis is the scarcity of multilingual resources. The existing resources are overwhelmingly monolingual, compelling researchers to infer the discourse-level information in the target languages through error-prone automatic means. The current paper aims to provide a more direct insight into the cross-lingual variations in discourse structures by linking the annotated relations of the TED-Multilingual Discourse Bank, which consi...
Citation Formats
D. Zeyrek Bozşahin, Y. Grishina, M. Kurfalı, S. Gibbon, and M. Ogrodniczuk, “TED Multilingual Discourse Bank (TED-MDB): a parallel corpus annotated in the PDTB style,” Language Resources and Evaluation, pp. 587–613, 2020, Accessed: 00, 2020. [Online]. Available: https://hdl.handle.net/11511/31690.