iDNA-OpenPrompt: OpenPrompt learning model for identifying DNA methylation.

Xia Yu, Jia Ren, Haixia Long, Rao Zeng, Guoqiang Zhang, Anas Bilal, Yani Cui
Author Information
  1. Xia Yu: School of Information and Communication Engineering, Hainan University, Haikou, Hainan, China.
  2. Jia Ren: School of Information and Communication Engineering, Hainan University, Haikou, Hainan, China.
  3. Haixia Long: School of Information Science and Technology, Hainan Normal University, Haikou, Hainan, China.
  4. Rao Zeng: School of Information Science and Technology, Hainan Normal University, Haikou, Hainan, China.
  5. Guoqiang Zhang: School of Information Science and Technology, Hainan Normal University, Haikou, Hainan, China.
  6. Anas Bilal: School of Information Science and Technology, Hainan Normal University, Haikou, Hainan, China.
  7. Yani Cui: School of Information and Communication Engineering, Hainan University, Haikou, Hainan, China.

Abstract

DNA methylation is a critical epigenetic modification involving the addition of a methyl group to the DNA molecule, playing a key role in regulating gene expression without changing the DNA sequence. The main difficulty in identifying DNA methylation sites lies in the subtle and complex nature of methylation patterns, which may vary across different tissues, developmental stages, and environmental conditions. Traditional methods for methylation site identification, such as bisulfite sequencing, are typically labor-intensive, costly, and require large amounts of DNA, hindering high-throughput analysis. Moreover, these methods may not always provide the resolution needed to detect methylation at specific sites, especially in genomic regions that are rich in repetitive sequences or have low levels of methylation. Furthermore, current deep learning approaches generally lack sufficient accuracy. This study introduces the iDNA-OpenPrompt model, leveraging the novel OpenPrompt learning framework. The model combines a prompt template, prompt verbalizer, and Pre-trained Language Model (PLM) to construct the prompt-learning framework for DNA methylation sequences. Moreover, a DNA vocabulary library, BERT tokenizer, and specific label words are also introduced into the model to enable accurate identification of DNA methylation sites. An extensive analysis is conducted to evaluate the predictive, reliability, and consistency capabilities of the iDNA-OpenPrompt model. The experimental outcomes, covering 17 benchmark datasets that include various species and three DNA methylation modifications (4mC, 5hmC, 6mA), consistently indicate that our model surpasses outstanding performance and robustness approaches.

Keywords

References

  1. RNA. 2019 Feb;25(2):205-218 [PMID: 30425123]
  2. Brief Bioinform. 2022 Mar 10;23(2): [PMID: 35225328]
  3. Bioinformatics. 2021 Dec 11;37(24):4603-4610 [PMID: 34601568]
  4. Brief Bioinform. 2021 Nov 5;22(6): [PMID: 34459479]
  5. IEEE/ACM Trans Comput Biol Bioinform. 2022 Jan-Feb;19(1):87-94 [PMID: 34014828]
  6. Genome Res. 2010 Mar;20(3):332-40 [PMID: 20107151]
  7. Bioinformatics. 2019 Apr 15;35(8):1326-1333 [PMID: 30239627]
  8. Brief Bioinform. 2021 May 20;22(3): [PMID: 32910169]
  9. Bioinformatics. 2020 Jan 15;36(2):388-392 [PMID: 31297537]
  10. iScience. 2020 Apr 24;23(4):100991 [PMID: 32240948]
  11. Mol Ther Nucleic Acids. 2019 Jun 7;16:733-744 [PMID: 31146255]
  12. BMC Bioinformatics. 2022 Jun 29;23(1):258 [PMID: 35768759]
  13. Nat Struct Mol Biol. 2013 Mar;20(3):274-81 [PMID: 23463312]
  14. Genome Biol. 2022 Oct 17;23(1):219 [PMID: 36253864]
  15. Front Genet. 2019 Oct 11;10:1071 [PMID: 31681441]
  16. Brief Bioinform. 2021 May 20;22(3): [PMID: 32608476]
  17. Brief Bioinform. 2021 May 20;22(3): [PMID: 32578842]
  18. Nucleic Acids Res. 2023 Apr 24;51(7):3017-3029 [PMID: 36796796]
  19. Bioinformatics. 2022 Aug 10;38(16):3885-3891 [PMID: 35771648]
  20. Front Med (Lausanne). 2023 May 04;10:1187430 [PMID: 37215722]
  21. Front Bioeng Biotechnol. 2020 Apr 21;8:274 [PMID: 32373597]
  22. Comput Biol Med. 2023 Jun;160:107030 [PMID: 37196456]
  23. Nucleic Acids Res. 2022 May 20;50(9):4877-4899 [PMID: 35524568]
  24. Molecules. 2021 Dec 07;26(24): [PMID: 34946497]

Word Cloud

Created with Highcharts 10.0.0DNAmethylationmodellearningpromptsitesOpenPromptidentifyingmaymethodsidentificationanalysisMoreoverspecificsequencesapproachesiDNA-OpenPromptframeworktemplateverbalizerBERTtokenizercriticalepigeneticmodificationinvolvingadditionmethylgroupmoleculeplayingkeyroleregulatinggeneexpressionwithoutchangingsequencemaindifficultyliessubtlecomplexnaturepatternsvaryacrossdifferenttissuesdevelopmentalstagesenvironmentalconditionsTraditionalsitebisulfitesequencingtypicallylabor-intensivecostlyrequirelargeamountshinderinghigh-throughputalwaysprovideresolutionneededdetectespeciallygenomicregionsrichrepetitivelowlevelsFurthermorecurrentdeepgenerallylacksufficientaccuracystudyintroducesleveragingnovelcombinesPre-trainedLanguageModelPLMconstructprompt-learningvocabularylibrarylabelwordsalsointroducedenableaccurateextensiveconductedevaluatepredictivereliabilityconsistencycapabilitiesexperimentaloutcomescovering17benchmarkdatasetsincludevariousspeciesthreemodifications4mC5hmC6mAconsistentlyindicatesurpassesoutstandingperformancerobustnessiDNA-OpenPrompt:

Similar Articles

Cited By