An investigation of irreproducibility in maximum likelihood phylogenetic inference.

Xing-Xing Shen, Yuanning Li, Chris Todd Hittinger, Xue-Xin Chen, Antonis Rokas
Author Information
  1. Xing-Xing Shen: State Key Laboratory of Rice Biology, Ministry of Agriculture Key Lab of Molecular Biology of Crop Pathogens and Insects, Zhejiang University, 310058, Hangzhou, China. xingxingshen@zju.edu.cn. ORCID
  2. Yuanning Li: Department of Biological Sciences, Vanderbilt University, Nashville, TN, 37235, USA. ORCID
  3. Chris Todd Hittinger: Laboratory of Genetics, J. F. Crow Institute for the Study of Evolution, Wisconsin Energy Institute, Center for Genomic Science Innovation, University of Wisconsin-Madison, Madison, WI, 53706, USA. ORCID
  4. Xue-Xin Chen: State Key Laboratory of Rice Biology, Ministry of Agriculture Key Lab of Molecular Biology of Crop Pathogens and Insects, Zhejiang University, 310058, Hangzhou, China. ORCID
  5. Antonis Rokas: Department of Biological Sciences, Vanderbilt University, Nashville, TN, 37235, USA. antonis.rokas@vanderbilt.edu. ORCID

Abstract

Phylogenetic trees are essential for studying biology, but their reproducibility under identical parameter settings remains unexplored. Here, we find that 3515 (18.11%) IQ-TREE-inferred and 1813 (9.34%) RAxML-NG-inferred maximum likelihood (ML) gene trees are topologically irreproducible when executing two replicates (Run1 and Run2) for each of 19,414 gene alignments in 15 animal, plant, and fungal phylogenomic datasets. Notably, coalescent-based ASTRAL species phylogenies inferred from Run1 and Run2 sets of individual gene trees are topologically irreproducible for 9/15 phylogenomic datasets, whereas concatenation-based phylogenies inferred twice from the same supermatrix are reproducible. Our simulations further show that irreproducible phylogenies are more likely to be incorrect than reproducible phylogenies. These results suggest that a considerable fraction of single-gene ML trees may be irreproducible. Increasing reproducibility in ML inference will benefit from providing analyses' log files, which contain typically reported parameters (e.g., program, substitution model, number of tree searches) but also typically unreported ones (e.g., random starting seed number, number of threads, processor type).

Associated Data

figshare | 10.6084/m9.figshare.11917770

References

  1. Bioinformatics. 2014 Sep 1;30(17):i541-8 [PMID: 25161245]
  2. Nat Ecol Evol. 2020 Nov;4(11):1435-1437 [PMID: 32884150]
  3. PLoS Med. 2005 Aug;2(8):e124 [PMID: 16060722]
  4. BMC Evol Biol. 2013 Aug 01;13:161 [PMID: 23914788]
  5. Bioinformatics. 2014 May 1;30(9):1312-3 [PMID: 24451623]
  6. Bioinformatics. 2019 Feb 1;35(3):526-528 [PMID: 30016406]
  7. Mol Biol Evol. 2020 Jan 1;37(1):291-294 [PMID: 31432070]
  8. PLoS One. 2019 Dec 18;14(12):e0225883 [PMID: 31851689]
  9. PLoS Biol. 2019 May 21;17(5):e3000255 [PMID: 31112549]
  10. Mol Biol Evol. 2015 Jan;32(1):268-74 [PMID: 25371430]
  11. Syst Biol. 2015 Sep;64(5):709-26 [PMID: 25999395]
  12. Science. 2016 Mar 25;351(6280):1433-6 [PMID: 26940865]
  13. Nature. 2013 Jul 11;499(7457):214-218 [PMID: 23770567]
  14. Mol Biol Evol. 2018 Jun 1;35(6):1547-1549 [PMID: 29722887]
  15. PLoS Biol. 2015 Nov 10;13(11):e1002295 [PMID: 26556502]
  16. Science. 2015 Aug 28;349(6251):aac4716 [PMID: 26315443]
  17. Trends Ecol Evol. 2016 Feb;31(2):116-126 [PMID: 26775796]
  18. Nature. 2019 Oct;574(7780):679-685 [PMID: 31645766]
  19. Curr Biol. 2011 Jan 25;21(2):134-9 [PMID: 21194949]
  20. Philos Trans R Soc Lond B Biol Sci. 2008 Dec 27;363(1512):3977-84 [PMID: 18852107]
  21. Syst Biol. 2015 Mar;64(2):356-62 [PMID: 25358969]
  22. Nature. 2018 May;557(7705):418-423 [PMID: 29743673]
  23. Nat Hum Behav. 2018 Sep;2(9):637-644 [PMID: 31346273]
  24. Syst Biol. 2002 Jun;51(3):492-508 [PMID: 12079646]
  25. Trends Pharmacol Sci. 2016 Apr;37(4):290-302 [PMID: 26776451]
  26. Nature. 2016 May 25;533(7604):452-4 [PMID: 27225100]
  27. Nature. 2011 May 19;473(7347):285 [PMID: 21593852]
  28. Syst Biol. 2012 Oct;61(5):727-44 [PMID: 22605266]
  29. PLoS Biol. 2014 Jan 28;12(1):e1001779 [PMID: 24492920]
  30. Bioinformatics. 2019 Nov 1;35(21):4453-4455 [PMID: 31070718]
  31. Angew Chem Int Ed Engl. 2016 Oct 4;55(41):12548-9 [PMID: 27558212]
  32. Bioinformatics. 2012 Sep 15;28(18):i409-i415 [PMID: 22962460]
  33. Nat Hum Behav. 2018 Nov;2(11):816-821 [PMID: 31558817]
  34. Mol Ecol. 2018 Jun 28;: [PMID: 29953708]
  35. Trends Ecol Evol. 2016 Sep;31(9):711-719 [PMID: 27461041]
  36. Science. 2009 Jan 23;323(5913):479-83 [PMID: 19164742]
  37. Mol Phylogenet Evol. 2015 Oct;91:98-122 [PMID: 26002829]
  38. Mol Ecol Resour. 2016 Sep;16(5):1059-68 [PMID: 26215687]
  39. Nature. 2012 Feb 22;482(7386):485-8 [PMID: 22358837]
  40. BMC Res Notes. 2012 Oct 22;5:574 [PMID: 23088596]
  41. Genome Biol. 2019 Nov 14;20(1):238 [PMID: 31727128]
  42. Science. 2013 Jul 12;341(6142):179-83 [PMID: 23765279]
  43. Syst Biol. 2012 Oct;61(5):717-26 [PMID: 22232343]
  44. PLoS One. 2010 Mar 10;5(3):e9490 [PMID: 20224823]
  45. Nature. 2013 Jan 17;493(7432):305 [PMID: 23325204]
  46. Science. 2014 Jan 17;343(6168):229 [PMID: 24436391]
  47. PLoS Biol. 2013 Sep;11(9):e1001636 [PMID: 24019756]
  48. Nat Microbiol. 2016 Apr 11;1:16048 [PMID: 27572647]
  49. Syst Biol. 2020 Jul 1;69(4):795-812 [PMID: 32011711]
  50. Mol Biol Evol. 1994 May;11(3):459-68 [PMID: 8015439]
  51. Bioinformatics. 2011 Feb 15;27(4):592-3 [PMID: 21169378]
  52. Comput Appl Biosci. 1997 Jun;13(3):235-8 [PMID: 9183526]
  53. Mol Biol Evol. 2018 Feb 1;35(2):486-503 [PMID: 29177474]
  54. Cell. 2018 Nov 29;175(6):1533-1545.e20 [PMID: 30415838]
  55. BMC Bioinformatics. 2018 May 8;19(Suppl 6):153 [PMID: 29745866]
  56. Nat Ecol Evol. 2018 Apr;2(4):688-696 [PMID: 29531346]
  57. Syst Biol. 2010 May;59(3):307-21 [PMID: 20525638]

MeSH Term

Animals
Fungi
Genes
Likelihood Functions
Mammals
Models, Genetic
Models, Statistical
Phylogeny
Plants
Reproducibility of Results

Word Cloud

Created with Highcharts 10.0.0treesirreproduciblephylogeniesMLgenenumberreproducibilitymaximumlikelihoodtopologicallyRun1Run2phylogenomicdatasetsinferredreproducibleinferencetypicallyegPhylogeneticessentialstudyingbiologyidenticalparametersettingsremainsunexploredfind35151811%IQ-TREE-inferred1813934%RAxML-NG-inferredexecutingtworeplicates19414alignments15animalplantfungalNotablycoalescent-basedASTRALspeciessetsindividual9/15whereasconcatenation-basedtwicesupermatrixsimulationsshowlikelyincorrectresultssuggestconsiderablefractionsingle-genemayIncreasingwillbenefitprovidinganalyses'logfilescontainreportedparametersprogramsubstitutionmodeltreesearchesalsounreportedonesrandomstartingseedthreadsprocessortypeinvestigationirreproducibilityphylogenetic

Similar Articles

Cited By