Database Commons
Database Commons

a catalog of worldwide biological databases

Database Profile

CRAFT

General information

URL: http://bionlp-corpora.sourceforge.net/CRAFT/index
Full name: Colorado Richly Annotated Full-Text (CRAFT) Corpus
Description: Concept annotation in the CRAFT corpus i.e. Colorado Richly Annotated Full-Text (CRAFT) Corpus, is a collection of 97 full-length, open-access biomedical journal articles that have been annotated both semantically and syntactically to serve as a research resource for the biomedical natural-language-processing (NLP) community.
Year founded: 2012
Last update:
Version:
Accessibility:
Accessible
Country/Region: United States

Classification & Tag

Data type:
Data object:
NA
Database category:
Major species:
NA
Keywords:

Contact information

University/Institution: University of Colorado Anschutz Medical Campus
Address: Department of Pharmacology, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
City: Aurora
Province/State:
Country/Region: United States
Contact name (PI/Team): Michael Bada
Contact email (PI/Helpdesk): ude.revnedcu@adab.ekim

Publications

22776079
Concept annotation in the CRAFT corpus. [PMID: 22776079]
Michael Bada, Miriam Eckert, Donald Evans, Kristin Garcia, Krista Shipley, Dmitry Sitnikov, William A Baumgartner, K Bretonnel Cohen, Karin Verspoor, Judith A Blake, Lawrence E Hunter

BACKGROUND: Manually annotated corpora are critical for the training and evaluation of automated methods to identify concepts in biomedical text.
RESULTS: This paper presents the concept annotations of the Colorado Richly Annotated Full-Text (CRAFT) Corpus, a collection of 97 full-length, open-access biomedical journal articles that have been annotated both semantically and syntactically to serve as a research resource for the biomedical natural-language-processing (NLP) community. CRAFT identifies all mentions of nearly all concepts from nine prominent biomedical ontologies and terminologies: the Cell Type Ontology, the Chemical Entities of Biological Interest ontology, the NCBI Taxonomy, the Protein Ontology, the Sequence Ontology, the entries of the Entrez Gene database, and the three subontologies of the Gene Ontology. The first public release includes the annotations for 67 of the 97 articles, reserving two sets of 15 articles for future text-mining competitions (after which these too will be released). Concept annotations were created based on a single set of guidelines, which has enabled us to achieve consistently high interannotator agreement.
CONCLUSIONS: As the initial 67-article release contains more than 560,000 tokens (and the full set more than 790,000 tokens), our corpus is among the largest gold-standard annotated biomedical corpora. Unlike most others, the journal articles that comprise the corpus are drawn from diverse biomedical disciplines and are marked up in their entirety. Additionally, with a concept-annotation count of nearly 100,000 in the 67-article subset (and more than 140,000 in the full collection), the scale of conceptual markup is also among the largest of comparable corpora. The concept annotations of the CRAFT Corpus have the potential to significantly advance biomedical text mining by providing a high-quality gold standard for NLP systems. The corpus, annotation guidelines, and other associated resources are freely available at http://bionlp-corpora.sourceforge.net/CRAFT/index.shtml.

BMC Bioinformatics. 2012:13() | 114 Citations (from Europe PMC, 2025-12-13)

Ranking

All databases:
1592/6895 (76.925%)
Literature:
145/577 (75.043%)
Metadata:
153/719 (78.86%)
1592
Total Rank
112
Citations
8.615
z-index

Community reviews

Not Rated
Data quality & quantity:
Content organization & presentation
System accessibility & reliability:

Word cloud

Related Databases

Citing
Cited by

Record metadata

Created on: 2019-10-21
Curated by:
Amjad Ali [2019-11-13]
Ghulam Abbas [2019-10-21]