A penalized-likelihood method to estimate the distribution of selection coefficients from phylogenetic data.

Asif U Tamuri, Nick Goldman, Mario dos Reis
Author Information
  1. Asif U Tamuri: European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom.

Abstract

We develop a maximum penalized-likelihood (MPL) method to estimate the fitnesses of amino acids and the distribution of selection coefficients (S = 2Ns) in protein-coding genes from phylogenetic data. This improves on a previous maximum-likelihood method. Various penalty functions are used to penalize extreme estimates of the fitnesses, thus correcting overfitting by the previous method. Using a combination of computer simulation and real data analysis, we evaluate the effect of the various penalties on the estimation of the fitnesses and the distribution of S. We show the new method regularizes the estimates of the fitnesses for small, relatively uninformative data sets, but it can still recover the large proportion of deleterious mutations when present in simulated data. Computer simulations indicate that as the number of taxa in the phylogeny or the level of sequence divergence increases, the distribution of S can be more accurately estimated. Furthermore, the strength of the penalty can be varied to study how informative a particular data set is about the distribution of S. We analyze three protein-coding genes (the chloroplast rubisco protein, mammal mitochondrial proteins, and an influenza virus polymerase) and show the new method recovers a large proportion of deleterious mutations in these data, even under strong penalties, confirming the distribution of S is bimodal in these real data. We recommend the use of the new MPL approach for the estimation of the distribution of S in species phylogenies of protein-coding genes.

Keywords

References

  1. Mol Biol Evol. 2000 Dec;17(12):1854-8 [PMID: 11110901]
  2. J Biol Chem. 2010 Sep 10;285(37):28411-7 [PMID: 20538599]
  3. Mol Biol Evol. 2008 Mar;25(3):568-79 [PMID: 18178545]
  4. Mol Biol Evol. 2006 Jan;23(1):212-26 [PMID: 16177230]
  5. PLoS Genet. 2007 Sep;3(9):1572-86 [PMID: 17845075]
  6. J Mol Biol. 1999 Mar 19;287(1):187-98 [PMID: 10074416]
  7. Mol Biol Evol. 2003 Aug;20(8):1231-9 [PMID: 12777508]
  8. Mol Biol Evol. 1998 Jul;15(7):910-7 [PMID: 9656490]
  9. Syst Biol. 1997 Jun;46(2):346-53 [PMID: 11975345]
  10. Philos Trans R Soc Lond B Biol Sci. 2008 Dec 27;363(1512):4013-21 [PMID: 18852108]
  11. Evol Bioinform Online. 2010 May 24;6:73-90 [PMID: 20535232]
  12. Proc Natl Acad Sci U S A. 2013 Dec 24;110(52):21071-6 [PMID: 24324165]
  13. J Mol Biol. 2002 Nov 15;324(1):105-21 [PMID: 12421562]
  14. Mol Biol Evol. 2007 Aug;24(8):1667-77 [PMID: 17470435]
  15. Syst Biol. 2008 Feb;57(1):160-6 [PMID: 18300029]
  16. Philos Trans R Soc Lond B Biol Sci. 2010 Jun 27;365(1548):1975-82 [PMID: 20478892]
  17. J Mol Evol. 1985;22(2):160-74 [PMID: 3934395]
  18. Genetics. 1978 Oct;90(2):349-82 [PMID: 17248867]
  19. Proc Natl Acad Sci U S A. 2010 Mar 9;107(10):4629-34 [PMID: 20176949]
  20. Syst Biol. 2002 Aug;51(4):588-98 [PMID: 12228001]
  21. Mol Biol Evol. 2007 Aug;24(8):1586-91 [PMID: 17483113]
  22. Genetics. 2001 Oct;159(2):441-52 [PMID: 11606524]
  23. Gene. 1999 Sep 30;238(1):39-51 [PMID: 10570982]
  24. Protein Eng. 1997 Jun;10(6):647-57 [PMID: 9278277]
  25. PLoS Comput Biol. 2009 Nov;5(11):e1000564 [PMID: 19911053]
  26. Genetics. 2012 Mar;190(3):1101-15 [PMID: 22209901]
  27. Proc Natl Acad Sci U S A. 1999 Oct 26;96(22):12494-9 [PMID: 10535950]
  28. Comput Appl Biosci. 1996 Aug;12(4):327-45 [PMID: 8902360]
  29. Bioinformatics. 2005 Feb 15;21(4):456-63 [PMID: 15608047]
  30. Nat Rev Genet. 2007 Aug;8(8):610-8 [PMID: 17637733]
  31. Bioinformatics. 2014 Apr 1;30(7):1020-1 [PMID: 24351710]
  32. Genetics. 2013 Feb;193(2):557-64 [PMID: 23222651]
  33. Proc Natl Acad Sci U S A. 2011 May 10;108(19):7896-901 [PMID: 21464309]
  34. Mol Biol Evol. 2002 Jan;19(1):101-9 [PMID: 11752195]
  35. Syst Biol. 2011 Mar;60(2):161-74 [PMID: 21233085]

MeSH Term

Animals
Base Sequence
Computer Simulation
Evolution, Molecular
Genetic Fitness
Humans
Likelihood Functions
Mutation
Phylogeny
Selection, Genetic

Word Cloud

Created with Highcharts 10.0.0datadistributionmethodSfitnessesselectionprotein-codinggenesnewcanpenalized-likelihoodMPLestimatecoefficientsphylogeneticpreviouspenaltyestimatesrealpenaltiesestimationshowlargeproportiondeleteriousmutationschloroplastinfluenzadevelopmaximumaminoacids=2Nsimprovesmaximum-likelihoodVariousfunctionsusedpenalizeextremethuscorrectingoverfittingUsingcombinationcomputersimulationanalysisevaluateeffectvariousregularizessmallrelativelyuninformativesetsstillrecoverpresentsimulatedComputersimulationsindicatenumbertaxaphylogenylevelsequencedivergenceincreasesaccuratelyestimatedFurthermorestrengthvariedstudyinformativeparticularsetanalyzethreerubiscoproteinmammalmitochondrialproteinsviruspolymeraserecoversevenstrongconfirmingbimodalrecommenduseapproachspeciesphylogeniesfitnesseffectsmitochondriapenalizedlikelihoodcoefficient

Similar Articles

Cited By