BioBERT and Similar Approaches for Relation Extraction.

Advanced Search

Balu Bhasuran

Author Information

Balu Bhasuran: DRDO-BU Center for Life Sciences, Bharathiar University Campus, Coimbatore, Tamilnadu, India. balubhasuran08@gmail.com.

PMID: 35713867 DOI: 10.1007/978-1-0716-2305-3_12

In biomedicine, facts about relations between entities (disease, gene, drug, etc.) are hidden in the large trove of 30 million scientific publications. The curated information is proven to play an important role in various applications such as drug repurposing and precision medicine. Recently, due to the advancement in deep learning a transformer architecture named BERT (Bidirectional Encoder Representations from Transformers) has been proposed. This pretrained language model trained using the Books Corpus with 800M words and English Wikipedia with 2500M words reported state of the art results in various NLP (Natural Language Processing) tasks including relation extraction. It is a widely accepted notion that due to the word distribution shift, general domain models exhibit poor performance in information extraction tasks of the biomedical domain. Due to this, an architecture is later adapted to the biomedical domain by training the language models using 28 million scientific literatures from PubMed and PubMed central. This chapter presents a protocol for relation extraction using BERT by discussing state-of-the-art for BERT versions in the biomedical domain such as BioBERT. The protocol emphasis on general BERT architecture, pretraining and fine tuning, leveraging biomedical information, and finally a knowledge graph infusion to the BERT model layer.

BERT BioBERT Deep Learning Relation Extraction Text Mining Transformers

Zhao S, Su C, Lu Z, Wang F (2020) Recent advances in biomedical literature mining. Brief Bioinform 22(3):bbaa057. https://doi.org/10.1093/bib/bbaa057 [DOI: 10.1093/bib/bbaa057]
Kilicoglu H (2018) Biomedical text mining for research rigor and integrity: tasks, challenges, directions. Brief Bioinform 19(6):1400–1414. https://doi.org/10.1093/bib/bbx057 [DOI: 10.1093/bib/bbx057]
Murugesan G, Abdulkadhar S, Bhasuran B, Natarajan J (2017) BCC-NER: bidirectional, contextual clues named entity tagger for gene/protein mention recognition. EURASIP J Bioinforma Syst Biol 2017(1):7. https://doi.org/10.1186/s13637-017-0060-6 [DOI: 10.1186/s13637-017-0060-6]
Bhasuran B, Murugesan G, Abdulkadhar S, Natarajan J (2016) Stacked ensemble combined with fuzzy matching for biomedical named entity recognition of diseases. J Biomed Inform 64:1–9. https://doi.org/10.1016/j.jbi.2016.09.009 [DOI: 10.1016/j.jbi.2016.09.009]
Abdulkadhar S, Bhasuran B, Natarajan J (2020) Multiscale Laplacian graph kernel combined with lexico-syntactic patterns for biomedical event extraction from literature. Knowl Inf Syst. 63:143–173. https://doi.org/10.1007/s10115-020-01514-8
Maroli N, Kalagatur NK, Bhasuran B et al (2019) Molecular mechanism of T-2 toxin-induced cerebral edema by Aquaporin-4 blocking and permeation. J Chem Inf Model 59:4942–4958. https://doi.org/10.1021/acs.jcim.9b00711 [DOI: 10.1021/acs.jcim.9b00711]
Maroli N, Bhasuran B, Natarajan J, Kolandaivel P (2020) The potential role of procyanidin as a therapeutic agent against SARS-CoV-2: a text mining, molecular docking and molecular dynamics simulation approach. J Biomol Struct Dyn:1–16. https://doi.org/10.1080/07391102.2020.1823887
Bhasuran B, Natarajan J (2019) Distant supervision for large-scale extraction of gene–disease associations from literature using DeepDive. In: Lecture Notes in Networks and Systems, pp 367–374. https://doi.org/10.1007/978-981-13-2354-6_39
Bhasuran B, Natarajan J (2018) Automatic extraction of gene-disease associations from literature using joint ensemble learning. PLoS One 13(7):e0200699. https://doi.org/10.1371/journal.pone.0200699 [DOI: 10.1371/journal.pone.0200699]
Bhasuran B, Subramanian D, Natarajan J (2018) Text mining and network analysis to find functional associations of genes in high altitude diseases. Comput Biol Chem 75:101–110. https://doi.org/10.1016/j.compbiolchem.2018.05.002 [DOI: 10.1016/j.compbiolchem.2018.05.002]
Westergaard D, Stærfeldt H, Tønsberg C, Jensen L, Brunak S (2018) A comprehensive and quantitative comparison of text-mining in 15 million full-text articles versus their corresponding abstracts. PLoS Comput Biol 14(2):e1005962. https://doi.org/10.1371/journal.pcbi.1005962 [DOI: 10.1371/journal.pcbi.1005962]
Nadif M, Role F (2021) Unsupervised and self-supervised deep learning approaches for biomedical text mining. Brief Bioinform 22(2):1592–1603. https://doi.org/10.1093/bib/bbab016 [DOI: 10.1093/bib/bbab016]
Preiss J, Stevenson M, Gaizauskas R (2015) Exploring relation types for literature-based discovery. J Am Med Inform Assoc 22(5):987–992. https://doi.org/10.1093/jamia/ocv002 [DOI: 10.1093/jamia/ocv002]
Zhu F, Patumcharoenpol P, Zhang C, Yang Y, Chan J, Meechai A et al (2013) Biomedical text mining and its applications in cancer research. J Biomed Inform 46(2):200–211. https://doi.org/10.1016/j.jbi.2012.10.007 [DOI: 10.1016/j.jbi.2012.10.007]
Peters, M. E., Neumann, M., Logan IV, R. L., Schwartz, R., Joshi, V., Singh, S., & Smith, N. A. (2019). Knowledge enhanced contextual word representations. arXiv preprint arXiv:1909.04164
Devlin J, Chang MW, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL HLT 2019–2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference, pp 4171–4186
Lee J, Yoon W, Kim S et al (2020) BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36:1234–1240. https://doi.org/10.1093/bioinformatics/btz682 [DOI: 10.1093/bioinformatics/btz682]
Clark K, Luong MT, Le QV, Manning CD (2020) Electra: pre-training text encoders as discriminators rather than generators. arXiv:preprint arXiv:2003.10555
Miolo G, Mantoan G, Orsenigo C (2021) ELECTRAMed: a new pre-trained language representation model for biomedical NLP. arXiv:preprint arXiv:2104.09585
Lim S, Kang J (2018) Chemical-gene relation extraction using recursive neural network. Database 2018:bay060. https://doi.org/10.1093/database/bay060 [DOI: 10.1093/database/bay060]
Alsentzer E, Murphy JR, Boag W, Weng WH, Jin D, Naumann T, McDermott M (2019) Publicly available clinical BERT embeddings. arXiv:preprint arXiv:1904.03323
Fei H, Ren Y, Zhang Y et al (2021) Enriching contextualized language model from knowledge graph for biomedical information extraction. Brief Bioinform 22:bbaa110. https://doi.org/10.1093/bib/bbaa110rt [DOI: 10.1093/bib/bbaa110rt]
Beltagy I, Lo K, Cohan A (2020) SCIBERT: a pretrained language model for scientific text. In: EMNLP-IJCNLP 2019–2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing, Proceedings of the Conference, pp 3615–3620

Information Storage and Retrieval

Language

Natural Language Processing

PubMed

Publications

Journal Article

OpenLB
Open Library of Bioscience