BioBERT and Similar Approaches for Relation Extraction.

Balu Bhasuran
Author Information
  1. Balu Bhasuran: DRDO-BU Center for Life Sciences, Bharathiar University Campus, Coimbatore, Tamilnadu, India. balubhasuran08@gmail.com.

Abstract

In biomedicine, facts about relations between entities (disease, gene, drug, etc.) are hidden in the large trove of 30 million scientific publications. The curated information is proven to play an important role in various applications such as drug repurposing and precision medicine. Recently, due to the advancement in deep learning a transformer architecture named BERT (Bidirectional Encoder Representations from Transformers) has been proposed. This pretrained language model trained using the Books Corpus with 800M words and English Wikipedia with 2500M words reported state of the art results in various NLP (Natural Language Processing) tasks including relation extraction. It is a widely accepted notion that due to the word distribution shift, general domain models exhibit poor performance in information extraction tasks of the biomedical domain. Due to this, an architecture is later adapted to the biomedical domain by training the language models using 28 million scientific literatures from PubMed and PubMed central. This chapter presents a protocol for relation extraction using BERT by discussing state-of-the-art for BERT versions in the biomedical domain such as BioBERT. The protocol emphasis on general BERT architecture, pretraining and fine tuning, leveraging biomedical information, and finally a knowledge graph infusion to the BERT model layer.

Keywords

References

  1. Zhao S, Su C, Lu Z, Wang F (2020) Recent advances in biomedical literature mining. Brief Bioinform 22(3):bbaa057. https://doi.org/10.1093/bib/bbaa057 [DOI: 10.1093/bib/bbaa057]
  2. Kilicoglu H (2018) Biomedical text mining for research rigor and integrity: tasks, challenges, directions. Brief Bioinform 19(6):1400–1414. https://doi.org/10.1093/bib/bbx057 [DOI: 10.1093/bib/bbx057]
  3. Murugesan G, Abdulkadhar S, Bhasuran B, Natarajan J (2017) BCC-NER: bidirectional, contextual clues named entity tagger for gene/protein mention recognition. EURASIP J Bioinforma Syst Biol 2017(1):7. https://doi.org/10.1186/s13637-017-0060-6 [DOI: 10.1186/s13637-017-0060-6]
  4. Bhasuran B, Murugesan G, Abdulkadhar S, Natarajan J (2016) Stacked ensemble combined with fuzzy matching for biomedical named entity recognition of diseases. J Biomed Inform 64:1–9. https://doi.org/10.1016/j.jbi.2016.09.009 [DOI: 10.1016/j.jbi.2016.09.009]
  5. Abdulkadhar S, Bhasuran B, Natarajan J (2020) Multiscale Laplacian graph kernel combined with lexico-syntactic patterns for biomedical event extraction from literature. Knowl Inf Syst. 63:143–173. https://doi.org/10.1007/s10115-020-01514-8
  6. Maroli N, Kalagatur NK, Bhasuran B et al (2019) Molecular mechanism of T-2 toxin-induced cerebral edema by Aquaporin-4 blocking and permeation. J Chem Inf Model 59:4942–4958. https://doi.org/10.1021/acs.jcim.9b00711 [DOI: 10.1021/acs.jcim.9b00711]
  7. Maroli N, Bhasuran B, Natarajan J, Kolandaivel P (2020) The potential role of procyanidin as a therapeutic agent against SARS-CoV-2: a text mining, molecular docking and molecular dynamics simulation approach. J Biomol Struct Dyn:1–16. https://doi.org/10.1080/07391102.2020.1823887
  8. Bhasuran B, Natarajan J (2019) Distant supervision for large-scale extraction of gene–disease associations from literature using DeepDive. In: Lecture Notes in Networks and Systems, pp 367–374. https://doi.org/10.1007/978-981-13-2354-6_39
  9. Bhasuran B, Natarajan J (2018) Automatic extraction of gene-disease associations from literature using joint ensemble learning. PLoS One 13(7):e0200699. https://doi.org/10.1371/journal.pone.0200699 [DOI: 10.1371/journal.pone.0200699]
  10. Bhasuran B, Subramanian D, Natarajan J (2018) Text mining and network analysis to find functional associations of genes in high altitude diseases. Comput Biol Chem 75:101–110. https://doi.org/10.1016/j.compbiolchem.2018.05.002 [DOI: 10.1016/j.compbiolchem.2018.05.002]
  11. Westergaard D, Stærfeldt H, Tønsberg C, Jensen L, Brunak S (2018) A comprehensive and quantitative comparison of text-mining in 15 million full-text articles versus their corresponding abstracts. PLoS Comput Biol 14(2):e1005962. https://doi.org/10.1371/journal.pcbi.1005962 [DOI: 10.1371/journal.pcbi.1005962]
  12. Nadif M, Role F (2021) Unsupervised and self-supervised deep learning approaches for biomedical text mining. Brief Bioinform 22(2):1592–1603. https://doi.org/10.1093/bib/bbab016 [DOI: 10.1093/bib/bbab016]
  13. Preiss J, Stevenson M, Gaizauskas R (2015) Exploring relation types for literature-based discovery. J Am Med Inform Assoc 22(5):987–992. https://doi.org/10.1093/jamia/ocv002 [DOI: 10.1093/jamia/ocv002]
  14. Zhu F, Patumcharoenpol P, Zhang C, Yang Y, Chan J, Meechai A et al (2013) Biomedical text mining and its applications in cancer research. J Biomed Inform 46(2):200–211. https://doi.org/10.1016/j.jbi.2012.10.007 [DOI: 10.1016/j.jbi.2012.10.007]
  15. Peters, M. E., Neumann, M., Logan IV, R. L., Schwartz, R., Joshi, V., Singh, S., & Smith, N. A. (2019). Knowledge enhanced contextual word representations. arXiv preprint arXiv:1909.04164
  16. Devlin J, Chang MW, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL HLT 2019–2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference, pp 4171–4186
  17. Lee J, Yoon W, Kim S et al (2020) BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36:1234–1240. https://doi.org/10.1093/bioinformatics/btz682 [DOI: 10.1093/bioinformatics/btz682]
  18. Clark K, Luong MT, Le QV, Manning CD (2020) Electra: pre-training text encoders as discriminators rather than generators. arXiv:preprint arXiv:2003.10555
  19. Miolo G, Mantoan G, Orsenigo C (2021) ELECTRAMed: a new pre-trained language representation model for biomedical NLP. arXiv:preprint arXiv:2104.09585
  20. Lim S, Kang J (2018) Chemical-gene relation extraction using recursive neural network. Database 2018:bay060. https://doi.org/10.1093/database/bay060 [DOI: 10.1093/database/bay060]
  21. Alsentzer E, Murphy JR, Boag W, Weng WH, Jin D, Naumann T, McDermott M (2019) Publicly available clinical BERT embeddings. arXiv:preprint arXiv:1904.03323
  22. Fei H, Ren Y, Zhang Y et al (2021) Enriching contextualized language model from knowledge graph for biomedical information extraction. Brief Bioinform 22:bbaa110. https://doi.org/10.1093/bib/bbaa110rt [DOI: 10.1093/bib/bbaa110rt]
  23. Beltagy I, Lo K, Cohan A (2020) SCIBERT: a pretrained language model for scientific text. In: EMNLP-IJCNLP 2019–2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing, Proceedings of the Conference, pp 3615–3620

MeSH Term

Information Storage and Retrieval
Language
Natural Language Processing
PubMed
Publications

Word Cloud

Created with Highcharts 10.0.0BERTdomainbiomedicalinformationarchitectureusingextractionBioBERTdrugscientificvariousdueTransformerslanguagemodelwordstasksrelationgeneralmodelsPubMedprotocolRelationExtractionbiomedicinefactsrelationsentitiesdiseasegeneetchiddenlargetrove30 millionpublicationscuratedprovenplayimportantroleapplicationsrepurposingprecisionmedicineRecentlyadvancementdeeplearningtransformernamedBidirectionalEncoderRepresentationsproposedpretrainedtrainedBooksCorpus800MEnglishWikipedia2500MreportedstateartresultsNLPNaturalLanguageProcessingincludingwidelyacceptednotionworddistributionshiftexhibitpoorperformanceDuelateradaptedtraining28 millionliteraturescentralchapterpresentsdiscussingstate-of-the-artversionsemphasispretrainingfinetuningleveragingfinallyknowledgegraphinfusionlayerSimilarApproachesDeepLearningTextMining

Similar Articles

Cited By