TED Multilingual Discourse Bank (TED-MDB): a parallel corpus annotated in the PDTB style

Date

2020-06-01

Author

Zeyrek Bozşahin, Deniz
Grishina, Yulia
Kurfalı, Murathan
Gibbon, Samuel
Ogrodniczuk, Maciej

Metadata

Show full item record

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Item Usage Stats

206
views

0
downloads

TED-Multilingual Discourse Bank, or TED-MDB, is a multilingual resource where TED-talks are annotated at the discourse level in 6 languages (English, Polish, German, Russian, European Portuguese, and Turkish) following the aims and principles of PDTB. We explain the corpus design criteria, which has three main features: the linguistic characteristics of the languages involved, the interactive nature of TED talks-which led us to annotate Hypophora, and the decision to avoid projection. We report our annotation consistency, and post-annotation alignment experiments, and provide a cross-lingual comparison based on corpus statistics.

Subject Keywords

Discourse, Discourse relations, Corpus creation, Annotation, Multilingual corpus

URI

https://hdl.handle.net/11511/31690

Journal

Language Resources and Evaluation

DOI

https://doi.org/10.1007/s10579-019-09445-9

Collections

Graduate School of Informatics, Article

Suggestions

OpenMETU
Core

Multilingual Extension of PDTB-Style Annotation: The Case of TED Multilingual Discourse Bank Zeyrek Bozşahin, Deniz; Kurfalı, Murathan (null; 2018-05-07) We introduce TED-Multilingual Discourse Bank, a corpus of TED talks transcripts in 6 languages (English, German, Polish, European Portuguese, Russian and Turkish), where the ultimate aim is to provide a clearly described level of discourse structure and semantics in multiple languages. The corpus is manually annotated following the goals and principles of PDTB, involving explicit and implicit discourse connectives, entity relations, alternative lexicalizations and no relations. In the corpus, we also aim to...
The Corpus of Turkish Youth Language (COTY): The compilation and interactional dynamics of a spoken corpus Efeoğlu Özcan, Esranur; Işık Güler, Hale; English Language Teaching (2022-9-2) This study examines the previously unattained research area of contemporary spoken Turkish used in dyadic and multi-party interaction among young speakers of Turkish. For this purpose, a specialized corpus called the Corpus of Turkish Youth Language (CoTY) was compiled as a source of data and as a tool of analysis. Designed to offer a maximally representative sample of Turkish youth talk, the CoTY contains naturally occurring and spontaneous interactional data among young people between the ages of 14-18 fr...
Pair Annotation as a Novel Annotation Procedure: The Case of Turkish Discourse Bank Demirşahin, Işın; Zeyrek Bozşahin, Deniz (Springer, 2017-01-01) In this chapter, we provide an overview of Turkish Discourse Bank, a resource of ∼ 400,000 words built on a sub-corpus of the 2-million-word METU Turkish Corpus annotated following the principles of Penn Discourse Tree Bank. We first present the annotation framework we adopted, explaining how it differs from the annotation of the original language, English. Then we focus on a novel annotation procedure that we have devised and named pair annotation after pair programming. We discuss the advantages it has ...
Pair Annotation as a Novel Annotation Procedure: The Case of Turkish Discourse Bank Demirşahin, Işın; Zeyrek Bozşahin, Deniz (2017-6-17) In this chapter, we provide an overview of Turkish Discourse Bank, a resource of ∼∼400,000 words built on a sub-corpus of the 2-million-word METU Turkish Corpus annotated following the principles of Penn Discourse Tree Bank. We first present the annotation framework we adopted, explaining how it differs from the annotation of the original language, English. Then we focus on a novel annotation procedure that we have devised and named pair annotation after pair programming. We discuss the advantages it has of...
Linking discourse-level information and the induction of bilingual discourse connective lexicons Özer, Sibel; Kurfall, Murathan; Zeyrek Bozşahin, Deniz; Mendes, Amália; Oleškevičiene, Giedre Valunaite (2022-6-20) The single biggest obstacle in performing comprehensive cross-lingual discourse analysis is the scarcity of multilingual resources. The existing resources are overwhelmingly monolingual, compelling researchers to infer the discourse-level information in the target languages through error-prone automatic means. The current paper aims to provide a more direct insight into the cross-lingual variations in discourse structures by linking the annotated relations of the TED-Multilingual Discourse Bank, which consi...

Citation Formats

D. Zeyrek Bozşahin, Y. Grishina, M. Kurfalı, S. Gibbon, and M. Ogrodniczuk, “TED Multilingual Discourse Bank (TED-MDB): a parallel corpus annotated in the PDTB style,” Language Resources and Evaluation, pp. 587–613, 2020, Accessed: 00, 2020. [Online]. Available: https://hdl.handle.net/11511/31690.