Online Bayesian Phylogenetic Inference: Theoretical Foundations via Sequential Monte Carlo.

Vu Dinh, Aaron E Darling, Frederick A Matsen Iv
Author Information
  1. Vu Dinh: Department of Mathematical Sciences, University of Delaware, 312 Ewing Hall, Newark, DE 19716, USA.
  2. Aaron E Darling: The ithree institute, University of Technology Sydney, 15 Broadway, Ultimo NSW 2007, Australia.
  3. Frederick A Matsen Iv: Program in Computational Biology, Fred Hutchinson Cancer Research Center, 1100 Fairview Avenue North, Seattle, WA 98109, USA.

Abstract

Phylogenetics, the inference of evolutionary trees from molecular sequence data such as DNA, is an enterprise that yields valuable evolutionary understanding of many biological systems. Bayesian phylogenetic algorithms, which approximate a posterior distribution on trees, have become a popular if computationally expensive means of doing phylogenetics. Modern data collection technologies are quickly adding new sequences to already substantial databases. With all current techniques for Bayesian phylogenetics, computation must start anew each time a sequence becomes available, making it costly to maintain an up-to-date estimate of a phylogenetic posterior. These considerations highlight the need for an online Bayesian phylogenetic method which can update an existing posterior with new sequences. Here, we provide theoretical results on the consistency and stability of methods for online Bayesian phylogenetic inference based on Sequential Monte Carlo (SMC) and Markov chain Monte Carlo. We first show a consistency result, demonstrating that the method samples from the correct distribution in the limit of a large number of particles. Next, we derive the first reported set of bounds on how phylogenetic likelihood surfaces change when new sequences are added. These bounds enable us to characterize the theoretical performance of sampling algorithms by bounding the effective sample size (ESS) with a given number of particles from below. We show that the ESS is guaranteed to grow linearly as the number of particles in an SMC sampler grows. Surprisingly, this result holds even though the dimensions of the phylogenetic model grow with each new added sequence.

References

  1. Bioinformatics. 2014 May 15;30(10):1476-7 [PMID: 24478338]
  2. Syst Biol. 2011 May;60(3):291-302 [PMID: 21436105]
  3. Mol Biol Evol. 2013 Apr;30(4):772-80 [PMID: 23329690]
  4. Genome Biol. 2015 Jul 30;16(1):155 [PMID: 27391693]
  5. Ann Appl Probab. 2017 Jun;27(3):1646-1677 [PMID: 37139100]
  6. Syst Biol. 2012 May;61(3):539-42 [PMID: 22357727]
  7. Nature. 2016 Feb 11;530(7589):228-232 [PMID: 26840485]
  8. Bioinformatics. 2010 Jan 15;26(2):266-7 [PMID: 19914921]
  9. Syst Biol. 2012 Jul;61(4):579-93 [PMID: 22223445]
  10. Bull Math Biol. 2011 Jun;73(6):1202-26 [PMID: 20640527]
  11. Bioinformatics. 2006 Aug 15;22(16):2047-8 [PMID: 16679334]
  12. Bioinformatics. 2015 Nov 1;31(21):3546-8 [PMID: 26115986]
  13. Syst Biol. 2018 May 01;67(3):490-502 [PMID: 29186587]
  14. PLoS Comput Biol. 2009 Sep;5(9):e1000520 [PMID: 19779555]
  15. BMC Bioinformatics. 2010 Oct 30;11:538 [PMID: 21034504]

Grants

  1. U54 GM111274/NIGMS NIH HHS
  2. /Howard Hughes Medical Institute

MeSH Term

Algorithms
Bayes Theorem
Classification
Models, Biological
Monte Carlo Method
Phylogeny

Word Cloud

Created with Highcharts 10.0.0phylogeneticBayesiannewsequenceposteriorsequencesMonteCarlonumberparticlesinferenceevolutionarytreesdataalgorithmsdistributionphylogeneticsonlinemethodtheoreticalconsistencySequentialSMCfirstshowresultboundsaddedESSgrowPhylogeneticsmolecularDNAenterpriseyieldsvaluableunderstandingmanybiologicalsystemsapproximatebecomepopularcomputationallyexpensivemeansModerncollectiontechnologiesquicklyaddingalreadysubstantialdatabasescurrenttechniquescomputationmuststartanewtimebecomesavailablemakingcostlymaintainup-to-dateestimateconsiderationshighlightneedcanupdateexistingprovideresultsstabilitymethodsbasedMarkovchaindemonstratingsamplescorrectlimitlargeNextderivereportedsetlikelihoodsurfaceschangeenableuscharacterizeperformancesamplingboundingeffectivesamplesizegivenguaranteedlinearlysamplergrowsSurprisinglyholdseventhoughdimensionsmodelOnlinePhylogeneticInference:TheoreticalFoundationsvia

Similar Articles

Cited By