DNA promoter task-oriented dictionary mining and prediction model based on natural language technology.

Ruolei Zeng, Zihan Li, Jialu Li, Qingchuan Zhang
Author Information
  1. Ruolei Zeng: Department of Computer Science and Engineering, University of Minnesota, Minneapolis, MN, 55455, USA.
  2. Zihan Li: National Engineering Research Centre for Agri-Product Quality Traceability, Beijing Technology and Business University, No.11 Fucheng Road, Beijing, 100048, China. 18811325239@163.com.
  3. Jialu Li: National Engineering Research Centre for Agri-Product Quality Traceability, Beijing Technology and Business University, No.11 Fucheng Road, Beijing, 100048, China.
  4. Qingchuan Zhang: National Engineering Research Centre for Agri-Product Quality Traceability, Beijing Technology and Business University, No.11 Fucheng Road, Beijing, 100048, China. zhangqingchuan@btbu.edu.cn.

Abstract

Promoters are essential DNA sequences that initiate transcription and regulate gene expression. Precisely identifying promoter sites is crucial for deciphering gene expression patterns and the roles of gene regulatory networks. Recent advancements in bioinformatics have leveraged deep learning and natural language processing (NLP) to enhance promoter prediction accuracy. Techniques such as convolutional neural networks (CNNs), long short-term memory (LSTM) networks, and BERT models have been particularly impactful. However, current approaches often rely on arbitrary DNA sequence segmentation during BERT pre-training, which may not yield optimal results. To overcome this limitation, this article introduces a novel DNA sequence segmentation method. This approach develops a more refined dictionary for DNA sequences, utilizes it for BERT pre-training, and employs an Inception neural network as the foundational model. This BERT-Inception architecture captures information across multiple granularities. Experimental results show that the model improves the performance of several downstream tasks and introduces deep learning interpretability, providing new perspectives for interpreting and understanding DNA sequence information. The detailed source code is available at https://github.com/katouMegumiH/Promoter_BERT .

References

  1. Bioinformatics. 2021 Apr 1;36(22-23):5291-5298 [PMID: 33325516]
  2. PeerJ. 2022 Jun 24;10:e13613 [PMID: 35769139]
  3. Brief Bioinform. 2021 Sep 2;22(5): [PMID: 33834200]
  4. Commun Biol. 2021 Sep 17;4(1):1094 [PMID: 34535759]
  5. Nat Commun. 2023 Aug 23;14(1):5135 [PMID: 37612313]
  6. Brief Bioinform. 2021 Jul 20;22(4): [PMID: 33227813]
  7. IEEE J Biomed Health Inform. 2020 Oct;24(10):3012-3019 [PMID: 32142462]
  8. Genome Biol. 2023 Jun 27;24(1):154 [PMID: 37370113]
  9. Brief Bioinform. 2021 Sep 2;22(5): [PMID: 33834200]
  10. Genomics. 2022 May;114(3):110384 [PMID: 35533969]
  11. J Bioinform Comput Biol. 2022 Jun;20(3):2250009 [PMID: 35603935]
  12. Comput Biol Chem. 2022 Aug;99:107732 [PMID: 35863177]
  13. PLoS One. 2015 Nov 10;10(11):e0141287 [PMID: 26555596]
  14. Anal Biochem. 2019 Apr 15;571:53-61 [PMID: 30822398]
  15. Genome Biol. 2020 Aug 3;21(1):190 [PMID: 32746932]
  16. Bioinformatics. 2021 Aug 9;37(15):2112-2120 [PMID: 33538820]
  17. Methods. 2022 Aug;204:199-206 [PMID: 34915158]
  18. Biomolecules. 2022 Jun 02;12(6): [PMID: 35740899]
  19. Bioinform Adv. 2023 Jan 11;3(1):vbad001 [PMID: 36845200]
  20. Front Bioeng Biotechnol. 2019 Nov 05;7:305 [PMID: 31750297]
  21. J Bioinform Comput Biol. 2012 Oct;10(5):1271001 [PMID: 22849370]
  22. Bioinformatics. 2022 Jan 12;38(3):597-603 [PMID: 34718418]
  23. Nucleic Acids Res. 2013 Jan;41(Database issue):D157-64 [PMID: 23193273]
  24. Front Genet. 2019 Apr 05;10:286 [PMID: 31024615]
  25. Bioinformatics. 2016 Jun 15;32(12):i121-i127 [PMID: 27307608]

Grants

  1. 2019YFC1606401/The National Key Technology R&D Program of China
  2. BPHR20220104/Project of Beijing Municipal University Teacher Team Construction Support Plan
  3. 099/Project of Beijing Scholars Program

MeSH Term

Promoter Regions, Genetic
Natural Language Processing
Neural Networks, Computer
Computational Biology
Deep Learning
Data Mining
Humans
DNA
Sequence Analysis, DNA

Chemicals

DNA

Word Cloud

Created with Highcharts 10.0.0DNAgenepromoternetworksBERTsequencemodelsequencesexpressiondeeplearningnaturallanguagepredictionneuralsegmentationpre-trainingresultsintroducesdictionaryinformationPromotersessentialinitiatetranscriptionregulatePreciselyidentifyingsitescrucialdecipheringpatternsrolesregulatoryRecentadvancementsbioinformaticsleveragedprocessingNLPenhanceaccuracyTechniquesconvolutionalCNNslongshort-termmemoryLSTMmodelsparticularlyimpactfulHowevercurrentapproachesoftenrelyarbitrarymayyieldoptimalovercomelimitationarticlenovelmethodapproachdevelopsrefinedutilizesemploysInceptionnetworkfoundationalBERT-InceptionarchitecturecapturesacrossmultiplegranularitiesExperimentalshowimprovesperformanceseveraldownstreamtasksinterpretabilityprovidingnewperspectivesinterpretingunderstandingdetailedsourcecodeavailablehttps://githubcom/katouMegumiH/Promoter_BERTtask-orientedminingbasedtechnology

Similar Articles

Cited By