The CHEMDNER corpus of chemicals and drugs and its annotation principles

Krallinger, Martin
Rabal, Obdulia
Leitner, Florian
Vazquez, Miguel
Salgado, David
Lu, Zhiyong
Leaman, Robert
Lu, Yanan
Ji, Donghong
Lowe, Daniel M.
Sayle, Roger A.
Batista-Navarro, Riza Theresa
Rak, Rafal
Huber, Torsten
Rocktaschel, Tim
Matos, Serergio
Campos, David
Tang, Buzhou
Xu, Hua
Munkhdalai, Tsendsuren
Ryu, Keun Ho
Ramanan, S. V.
Nathan, Senthil
Zitnik, Slavko
Bajec, Marko
Weber, Lutz
Irmer, Matthias
Akhondi, Saber A.
Kors, Jan A.
Xu, Shuo
An, Xin
Sikdar, Utpal Kumar
Ekbal, Asif
Yoshioka, Masaharu
Dieb, Thaer M.
Choi, Miji
Verspoor, Karin
Khabsa, Madian
Giles, C. Lee
Liu, Hongfang
Ravikumar, Komandur Elayavilli
Lamurias, Andre
Couto, Francisco M.
Dai, Hong-Jie
Tsai, Richard Tzong-Han
Ata, Caglar
Can, Tolga
Usie, Anabel
Alves, Rui
Segura-Bedmar, Isabel
Martinez, Paloma
Oyarzabal, Julen
Valencia, Alfonso
The automatic extraction of chemical information from text requires the recognition of chemical entity mentions as one of its key steps. When developing supervised named entity recognition (NER) systems, the availability of a large, manually annotated text corpus is desirable. Furthermore, large corpora permit the robust evaluation and comparison of different approaches that detect chemicals in documents. We present the CHEMDNER corpus, a collection of 10,000 PubMed abstracts that contain a total of 84,355 chemical entity mentions labeled manually by expert chemistry literature curators, following annotation guidelines specifically defined for this task. The abstracts of the CHEMDNER corpus were selected to be representative for all major chemical disciplines. Each of the chemical entity mentions was manually labeled according to its structure-associated chemical entity mention (SACEM) class: abbreviation, family, formula, identifier, multiple, systematic and trivial. The difficulty and consistency of tagging chemicals in text was measured using an agreement study between annotators, obtaining a percentage agreement of 91. For a subset of the CHEMDNER corpus (the test set of 3,000 abstracts) we provide not only the Gold Standard manual annotations, but also mentions automatically detected by the 26 teams that participated in the BioCreative IV CHEMDNER chemical mention recognition task. In addition, we release the CHEMDNER silver standard corpus of automatically extracted mentions from 17,000 randomly selected PubMed abstracts. A version of the CHEMDNER corpus in the BioC format has been generated as well. We propose a standard for required minimum information about entity annotations for the construction of domain specific corpora on chemical and drug entities. The CHEMDNER corpus and annotation guidelines are available at:


Named Entity Recognition in Turkish with Bayesian Learning and Hybrid Approaches
RehaYavuz, Sermet; Kucuk, Dilek; Yazıcı, Adnan (2013-10-29)
Named entity recognition is one of the significant textual information extraction tasks. In this paper, we present two approaches for named entity recognition on Turkish texts. The first is a Bayesian learning approach which is trained on a considerably limited training set. The second approach comprises two hybrid systems based on joint utilization of this Bayesian learning approach and a previously proposed rule-based named entity recognizer. All of the proposed three approaches achieve promising performa...
A hybrid named entity recognizer for Turkish
Kucuk, Dilek; Yazıcı, Adnan (2012-02-15)
Named entity recognition is an important subfield of the broader research area of information extraction from textual data. Yet, named entity recognition research conducted on Turkish texts is still rare as compared to related research carried out on other languages such as English, Spanish, Chinese, and Japanese. In this study, we present a hybrid named entity recognizer for Turkish, which is based on a manually engineered rule based recognizer that we have proposed. Since rule based systems for specific d...
Financial named entity recognition for Turkish news texts
Dinç, Duygu; Doğru, Ali Hikmet; Karagöz, Pınar; Department of Computer Engineering (2022-7-26)
Named Entity Recognition (NER) is a problem of information extraction where the objective is; in a given text, to detect and label named entities (NE) according to predetermined categories correctly. An NE may be a noun or a group of nouns which correspond to the name of a specific object, location or a concept in case of domain-specific applications. In the literature, person, organization, location names or date,time, money, percentage expressions are among highly studied, generic NEs. Besides, there are ...
Named entity recognition experiments on Turkish texts
Küçük, Dilek; Yazıcı, Adnan (2009-10-28)
Named entity recognition (NER) is one of the main information extraction tasks and research on NER from Turkish texts is known to be rare. In this study, we present a rule-based NER system for Turkish which employs a set of lexical resources and pattern bases for the extraction of named entities including the names of people, locations, organizations together with time/date and money/percentage expressions. The domain of the system is news texts and it does not utilize important clues of capitalization and ...
Novel supramolecular ion sensing systems and their application in molecular logic gates
Coşkun, Ali; Akkaya, Engin Umut; Department of Chemistry (2003)
Recognition and sensing of ions is an important front in supramolecular organic chemistry. One remarkable extension of this kind of work is the application of selective switching processes to logic gate operations. In this study, we have designed selective metal ion chelators for zinc and cadmium ions based on dansylamide fluorophores and dipicolylamine chelators. The zinc complex of a previously reported difluoroboradiazaindacene-bipyridyl derivative was shown to respond anions by an increase in emission i...
Citation Formats
M. Krallinger et al., “The CHEMDNER corpus of chemicals and drugs and its annotation principles,” JOURNAL OF CHEMINFORMATICS, pp. 0–0, 2015, Accessed: 00, 2020. [Online]. Available: