Pre-trained models, data augmentation, and ensemble learning for biomedical information extraction and document classification.

Arslan Erdengasileng, Qing Han, Tingting Zhao, Shubo Tian, Xin Sui, Keqiao Li, Wanjing Wang, Jian Wang, Ting Hu, Feng Pan, Yuan Zhang, Jinfeng Zhang
Author Information
  1. Arslan Erdengasileng: Department of Statistics, Florida State University, Tallahassee, FL 32306, USA.
  2. Qing Han: Department of Statistics, Florida State University, Tallahassee, FL 32306, USA.
  3. Tingting Zhao: Department of Geography, Florida State University, Tallahassee, FL 32306, USA.
  4. Shubo Tian: Department of Statistics, Florida State University, Tallahassee, FL 32306, USA. ORCID
  5. Xin Sui: Department of Statistics, Florida State University, Tallahassee, FL 32306, USA.
  6. Keqiao Li: Department of Statistics, Florida State University, Tallahassee, FL 32306, USA.
  7. Wanjing Wang: Department of Statistics, Florida State University, Tallahassee, FL 32306, USA.
  8. Jian Wang: Cloudmedx Inc, Palo Alto, CA 94301, USA.
  9. Ting Hu: Department of Statistics, Florida State University, Tallahassee, FL 32306, USA.
  10. Feng Pan: Department of Statistics, Florida State University, Tallahassee, FL 32306, USA.
  11. Yuan Zhang: Department of Statistics, Florida State University, Tallahassee, FL 32306, USA. ORCID
  12. Jinfeng Zhang: Department of Statistics, Florida State University, Tallahassee, FL 32306, USA. ORCID

Abstract

Large volumes of publications are being produced in biomedical sciences nowadays with ever-increasing speed. To deal with the large amount of unstructured text data, effective natural language processing (NLP) methods need to be developed for various tasks such as document classification and information extraction. BioCreative Challenge was established to evaluate the effectiveness of information extraction methods in biomedical domain and facilitate their development as a community-wide effort. In this paper, we summarize our work and what we have learned from the latest round, BioCreative Challenge VII, where we participated in all five tracks. Overall, we found three key components for achieving high performance across a variety of NLP tasks: (1) pre-trained NLP models; (2) data augmentation strategies and (3) ensemble modelling. These three strategies need to be tailored towards the specific tasks at hands to achieve high-performing baseline models, which are usually good enough for practical applications. When further combined with task-specific methods, additional improvements (usually rather small) can be achieved, which might be critical for winning competitions. Database URL: https://doi.org/10.1093/database/baac066.

References

  1. BMC Genomics. 2020 Nov 10;21(1):773 [PMID: 33167858]
  2. Brief Bioinform. 2016 Jan;17(1):132-44 [PMID: 25935162]
  3. Bioinformatics. 2012 Mar 1;28(5):747-9 [PMID: 22238258]
  4. Database (Oxford). 2023 Mar 7;2023: [PMID: 36882099]
  5. Int J Data Min Bioinform. 2013;7(4):450-62 [PMID: 23798227]
  6. PLoS One. 2012;7(4):e34480 [PMID: 22493694]
  7. BMC Bioinformatics. 2007 Feb 09;8:50 [PMID: 17291334]
  8. Bioinformatics. 2020 Feb 15;36(4):1234-1240 [PMID: 31501885]
  9. Genome Biol. 2008;9 Suppl 2:S1 [PMID: 18834487]
  10. Nature. 2020 Mar;579(7798):193 [PMID: 32157233]
  11. BMC Bioinformatics. 2005;6 Suppl 1:S1 [PMID: 15960821]
  12. Nucleic Acids Res. 2019 Jul 2;47(W1):W587-W593 [PMID: 31114887]
  13. BMC Bioinformatics. 2011 Oct 03;12 Suppl 8:S1 [PMID: 22151647]
  14. BMC Bioinformatics. 2008 Sep 25;9:402 [PMID: 18817555]
  15. Bioinformatics. 2009 Jun 15;25(12):1536-42 [PMID: 19369495]
  16. PLoS One. 2011;6(6):e21474 [PMID: 21738677]
  17. Genome Biol. 2008;9 Suppl 2:S6 [PMID: 18834497]
  18. Cell. 2008 Jul 11;134(1):9-13 [PMID: 18614002]
  19. Nucleic Acids Res. 2021 Jan 8;49(D1):D1534-D1540 [PMID: 33166392]
  20. Database (Oxford). 2019 Jan 1;2019: [PMID: 30624652]
  21. Sci Data. 2021 Mar 25;8(1):91 [PMID: 33767203]

Grants

  1. R01 GM126558/NIGMS NIH HHS

MeSH Term

Data Mining
Databases, Factual
Machine Learning
Natural Language Processing

Word Cloud

Created with Highcharts 10.0.0biomedicaldataNLPmethodsinformationextractionmodelsneedtasksdocumentclassificationBioCreativeChallengethreeaugmentationstrategiesensembleusuallyLargevolumespublicationsproducedsciencesnowadaysever-increasingspeeddeallargeamountunstructuredtexteffectivenaturallanguageprocessingdevelopedvariousestablishedevaluateeffectivenessdomainfacilitatedevelopmentcommunity-wideeffortpapersummarizeworklearnedlatestroundVIIparticipatedfivetracksOverallfoundkeycomponentsachievinghighperformanceacrossvarietytasks:1pre-trained23modellingtailoredtowardsspecifichandsachievehigh-performingbaselinegoodenoughpracticalapplicationscombinedtask-specificadditionalimprovementsrathersmallcanachievedmightcriticalwinningcompetitionsDatabaseURL:https://doiorg/101093/database/baac066Pre-trainedlearning

Similar Articles

Cited By