Show/Hide Menu
Hide/Show Apps
Logout
Türkçe
Türkçe
Search
Search
Login
Login
OpenMETU
OpenMETU
About
About
Open Science Policy
Open Science Policy
Open Access Guideline
Open Access Guideline
Postgraduate Thesis Guideline
Postgraduate Thesis Guideline
Communities & Collections
Communities & Collections
Help
Help
Frequently Asked Questions
Frequently Asked Questions
Guides
Guides
Thesis submission
Thesis submission
MS without thesis term project submission
MS without thesis term project submission
Publication submission with DOI
Publication submission with DOI
Publication submission
Publication submission
Supporting Information
Supporting Information
General Information
General Information
Copyright, Embargo and License
Copyright, Embargo and License
Contact us
Contact us
A multi-level multi-label text classification dataset of 19th century Ottoman and Russian literary and critical texts
Date
2024-01-01
Author
Gokceoglu, Gokcen
Cavusoglu, Devrim
Akbaş, Emre
Dolcerocca, Ozen Nergis
Metadata
Show full item record
This work is licensed under a
Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License
.
Item Usage Stats
8
views
0
downloads
Cite This
This paper introduces a multi-level, multi-label text classification dataset comprising over 3000 documents. The dataset features literary and critical texts from 19th-century Ottoman Turkish and Russian. It is the first study to apply large language models (LLMs) to this dataset, sourced from prominent literary periodicals of the era. The texts have been meticulously organized and labeled. This was done according to a taxonomic framework that takes into account both their structural and semantic attributes. Articles are categorized and tagged with bibliometric metadata by human experts. We present baseline classification results using a classical bag-of-words (BoW) naive Bayes model and three modern LLMs: multilingual BERT, Falcon, and Llama-v2. We found that in certain cases, Bag of Words (BoW) outperforms Large Language Models (LLMs), emphasizing the need for additional research, especially in low-resource language settings. This dataset is expected to be a valuable resource for researchers in natural language processing and machine learning, especially for historical and low-resource languages. The dataset is publicly available.
URI
https://www.scopus.com/inward/record.uri?partnerID=HzOxMe3b&scp=85205304519&origin=inward
https://hdl.handle.net/11511/112148
Conference Name
Findings of the 62nd Annual Meeting of the Association for Computational Linguistics, ACL 2024
Collections
Department of Computer Engineering, Conference / Seminar
Citation Formats
IEEE
ACM
APA
CHICAGO
MLA
BibTeX
G. Gokceoglu, D. Cavusoglu, E. Akbaş, and O. N. Dolcerocca, “A multi-level multi-label text classification dataset of 19th century Ottoman and Russian literary and critical texts,” presented at the Findings of the 62nd Annual Meeting of the Association for Computational Linguistics, ACL 2024, Hybrid, Bangkok, Tayland, 2024, Accessed: 00, 2024. [Online]. Available: https://www.scopus.com/inward/record.uri?partnerID=HzOxMe3b&scp=85205304519&origin=inward.