Latent Dirichlet allocation mixture models for nucleotide sequence analysis.

Bixuan Wang, Stephen M Mount
Author Information
  1. Bixuan Wang: Dept. of Cell Biology and Molecular Genetics, University of Maryland, College Park, MD 20742, USA.
  2. Stephen M Mount: Dept. of Cell Biology and Molecular Genetics, University of Maryland, College Park, MD 20742, USA. ORCID

Abstract

Strings of nucleotides carrying biological information are typically described as sequence motifs represented by weight matrices or consensus sequences. However, many signals in DNA or RNA are recognized by multiple factors in temporal sequence, consist of distinct alternative motifs, or are best described by base composition. Here we apply the latent Dirichlet allocation (LDA) mixture model to nucleotide sequences. Using positions in an alignment of human or Drosophila splice sites as samples, we show that LDA readily identifies motifs, including such elusive cases as the intron branch site. Using whole sequences with positional k-mers as features, LDA can identify sequence subtypes enriched in long vs. short introns. LDA with bulk k-mers can reliably distinguish reading frame and species of origin in coding sequences from humans and Drosophila. We find that LDA is a useful model for describing heterogeneous signals, for assigning individual sequences to subtypes, and for identifying and characterizing sequences that do not fit recognized subtypes. Because LDA topic models are interpretable, they also aid the discovery of new motifs, even those present in a small fraction of samples. In summary, LDA can identify and characterize signals in nucleotide sequences, including candidate regulatory factors involved in biological processes.

References

  1. Nature. 2020 Jul;583(7818):711-719 [PMID: 32728246]
  2. Proc Natl Acad Sci U S A. 1983 Feb;80(4):950-4 [PMID: 6573664]
  3. Cell Biosci. 2011 Jun 27;1(1):23 [PMID: 21711503]
  4. Nucleic Acids Res. 2015 Jul 1;43(W1):W39-49 [PMID: 25953851]
  5. Nature. 2020 Sep;585(7825):357-362 [PMID: 32939066]
  6. Bioinformatics. 2020 Apr 1;36(7):2272-2274 [PMID: 31821414]
  7. Genome Res. 2006 Jan;16(1):1-10 [PMID: 16344566]
  8. Nature. 1985 Feb 14-20;313(6003):552-7 [PMID: 2578627]
  9. Annu Rev Genomics Hum Genet. 2024 Apr 09;: [PMID: 38594933]
  10. Curr Genomics. 2009 Sep;10(6):402-15 [PMID: 20190955]
  11. Genome Res. 2017 Apr;27(4):639-649 [PMID: 28119336]
  12. Bioinformatics. 2019 Nov 1;35(22):4543-4552 [PMID: 30993319]
  13. Genetics. 2000 Jun;155(2):945-59 [PMID: 10835412]
  14. Nucleic Acids Res. 2016 Jan 4;44(D1):D733-45 [PMID: 26553804]
  15. J Mol Biol. 1995 Oct 27;253(3):426-37 [PMID: 7473725]
  16. Bioinformatics. 2018 Jul 15;34(14):2483-2484 [PMID: 29514181]
  17. Bioinformatics. 2000 Jan;16(1):16-23 [PMID: 10812473]
  18. Annu Rev Biochem. 2020 Jun 20;89:359-388 [PMID: 31794245]
  19. Science. 2016 Jan 29;351(6272):aad3867 [PMID: 26823435]
  20. Protein Sci. 2022 Jan;31(1):8-22 [PMID: 34717010]
  21. PLoS Genet. 2017 Mar 23;13(3):e1006599 [PMID: 28333934]
  22. J Evol Biol. 2023 Sep;36(9):1295-1312 [PMID: 37564008]
  23. Science. 2024 Apr 26;384(6694):eadj0116 [PMID: 38662817]
  24. Cell. 2019 Jan 24;176(3):535-548.e24 [PMID: 30661751]
  25. Nucleic Acids Res. 2006 Jul 1;34(Web Server issue):W369-73 [PMID: 16845028]

Word Cloud

Created with Highcharts 10.0.0sequencesLDAsequencemotifssignalsnucleotidecansubtypesbiologicaldescribedrecognizedfactorsDirichletallocationmixturemodelUsingDrosophilasamplesincludingk-mersidentifymodelsStringsnucleotidescarryinginformationtypicallyrepresentedweightmatricesconsensusHowevermanyDNARNAmultipletemporalconsistdistinctalternativebestbasecompositionapplylatentpositionsalignmenthumansplicesitesshowreadilyidentifieselusivecasesintronbranchsitewholepositionalfeaturesenrichedlongvsshortintronsbulkreliablydistinguishreadingframespeciesorigincodinghumansfindusefuldescribingheterogeneousassigningindividualidentifyingcharacterizingfittopicinterpretablealsoaiddiscoverynewevenpresentsmallfractionsummarycharacterizecandidateregulatoryinvolvedprocessesLatentanalysis

Similar Articles

Cited By