Maximum-Likelihood Phylogenetic Inference with Selection on Protein Folding Stability.

Miguel Arenas, Agustin Sánchez-Cobos, Ugo Bastolla
Author Information
  1. Miguel Arenas: Department of Cell Biology and Immunology, Centro de Biología Molecular Severo Ochoa (CSIC-UAM), Universidad Autónoma de Madrid, Madrid, Spain.
  2. Agustin Sánchez-Cobos: Department of Cell Biology and Immunology, Centro de Biología Molecular Severo Ochoa (CSIC-UAM), Universidad Autónoma de Madrid, Madrid, Spain.
  3. Ugo Bastolla: Department of Cell Biology and Immunology, Centro de Biología Molecular Severo Ochoa (CSIC-UAM), Universidad Autónoma de Madrid, Madrid, Spain ubastolla@cbm.csic.es.

Abstract

Despite intense work, incorporating constraints on protein native structures into the mathematical models of molecular evolution remains difficult, because most models and programs assume that protein sites evolve independently, whereas protein stability is maintained by interactions between sites. Here, we address this problem by developing a new mean-field substitution model that generates independent site-specific amino acid distributions with constraints on the stability of the native state against both unfolding and misfolding. The model depends on a background distribution of amino acids and one selection parameter that we fix maximizing the likelihood of the observed protein sequence. The analytic solution of the model shows that the main determinant of the site-specific distributions is the number of native contacts of the site and that the most variable sites are those with an intermediate number of native contacts. The mean-field models obtained, taking into account misfolded conformations, yield larger likelihood than models that only consider the native state, because their average hydrophobicity is more realistic, and they produce on the average stable sequences for most proteins. We evaluated the mean-field model with respect to empirical substitution models on 12 test data sets of different protein families. In all cases, the observed site-specific sequence profiles presented smaller Kullback-Leibler divergence from the mean-field distributions than from the empirical substitution model. Next, we obtained substitution rates combining the mean-field frequencies with an empirical substitution model. The resulting mean-field substitution model assigns larger likelihood than the empirical model to all studied families when we consider sequences with identity larger than 0.35, plausibly a condition that enforces conservation of the native structure across the family. We found that the mean-field model performs better than other structurally constrained models with similar or higher complexity. With respect to the much more complex model recently developed by Bordner and Mittelmann, which takes into account pairwise terms in the amino acid distributions and also optimizes the exchangeability matrix, our model performed worse for data with small sequence divergence but better for data with larger sequence divergence. The mean-field model has been implemented into the computer program Prot_Evol that is freely available at http://ub.cbm.uam.es/software/Prot_Evol.php.

Keywords

References

  1. Pac Symp Biocomput. 2001;:191-202 [PMID: 11262940]
  2. Mol Biol Evol. 2001 May;18(5):691-9 [PMID: 11319253]
  3. Mol Biol Evol. 2001 May;18(5):750-6 [PMID: 11319259]
  4. Trends Genet. 2001 May;17(5):262-72 [PMID: 11335036]
  5. Proteins. 2001 Aug 1;44(2):79-96 [PMID: 11391771]
  6. J Mol Biol. 2001 Sep 7;312(1):289-307 [PMID: 11545603]
  7. Proteins. 2002 Jan 1;46(1):105-9 [PMID: 11746707]
  8. Mol Biol Evol. 2002 Mar;19(3):352-6 [PMID: 11861895]
  9. Q Rev Biophys. 2002 Aug;35(3):205-86 [PMID: 12599750]
  10. Mol Biol Evol. 2003 Oct;20(10):1692-704 [PMID: 12885968]
  11. J Mol Evol. 2003;57 Suppl 1:S103-19 [PMID: 15008407]
  12. Mol Biol Evol. 2004 Jun;21(6):1095-109 [PMID: 15014145]
  13. J Mol Biol. 2004 Nov 5;343(5):1451-66 [PMID: 15491623]
  14. Biophys J. 1987 Dec;52(6):1083-5 [PMID: 3427197]
  15. Mol Biol Evol. 1987 Jul;4(4):406-25 [PMID: 3447015]
  16. Biophys Chem. 1989 Nov;34(3):187-99 [PMID: 2611345]
  17. Comput Appl Biosci. 1992 Jun;8(3):275-82 [PMID: 1633570]
  18. Mol Biol Evol. 1993 Nov;10(6):1396-401 [PMID: 8277861]
  19. Proc Natl Acad Sci U S A. 1995 Feb 28;92(5):1282-6 [PMID: 7877968]
  20. Proteins. 1995 Mar;21(3):167-95 [PMID: 7784423]
  21. Fold Des. 1997;2(5):261-9 [PMID: 9261065]
  22. Proteins. 1997 Dec;29(4):461-6 [PMID: 9408943]
  23. Proc Natl Acad Sci U S A. 1998 Apr 28;95(9):4976-81 [PMID: 9560213]
  24. Mol Biol Evol. 1998 Jul;15(7):910-7 [PMID: 9656490]
  25. Proteins. 1998 Aug 15;32(3):289-95 [PMID: 9715905]
  26. Mol Biol Evol. 1998 Dec;15(12):1600-11 [PMID: 9866196]
  27. Mol Biol Evol. 1999 Feb;16(2):173-9 [PMID: 10028285]
  28. J Theor Biol. 1999 Sep 7;200(1):49-64 [PMID: 10479539]
  29. Proc Natl Acad Sci U S A. 1999 Sep 14;96(19):10689-94 [PMID: 10485887]
  30. BMC Evol Biol. 2004 Oct 28;4:42 [PMID: 15511291]
  31. Mol Biol Evol. 2009 Oct;26(10):2387-95 [PMID: 19597162]
  32. Biochim Biophys Acta. 2010 Jun;1804(6):1231-64 [PMID: 20117254]
  33. PLoS Comput Biol. 2010 May;6(5):e1000767 [PMID: 20463869]
  34. Annu Rev Phys Chem. 2011;62:301-26 [PMID: 21453060]
  35. Proteins. 2011 May;79(5):1396-407 [PMID: 21337623]
  36. Nucleic Acids Res. 2012 Jan;40(Database issue):D290-301 [PMID: 22127870]
  37. BMC Evol Biol. 2011;11:361 [PMID: 22171550]
  38. Protein Sci. 2012 Jun;21(6):769-85 [PMID: 22528593]
  39. PLoS Comput Biol. 2012;8(6):e1002572 [PMID: 22761562]
  40. Mol Biol Evol. 2013 Apr;30(4):772-80 [PMID: 23329690]
  41. Proteins. 2013 Jul;81(7):1102-12 [PMID: 23280507]
  42. Bioinformatics. 2013 Dec 1;29(23):3020-8 [PMID: 24037213]
  43. Mol Biol Evol. 2014 Mar;31(3):736-49 [PMID: 24307688]
  44. BMC Evol Biol. 2014;14:78 [PMID: 24716445]
  45. Proteins. 2005 Jan 1;58(1):22-30 [PMID: 15523667]
  46. Proc Natl Acad Sci U S A. 2005 Jan 18;102(3):606-11 [PMID: 15644440]
  47. Mol Biol Evol. 2005 Mar;22(3):630-8 [PMID: 15537801]
  48. Gene. 2005 Mar 14;347(2):207-17 [PMID: 15733531]
  49. Proc Natl Acad Sci U S A. 2005 Jul 5;102(27):9541-6 [PMID: 15980155]
  50. Nat Rev Genet. 2005 Sep;6(9):678-87 [PMID: 16074985]
  51. Chem Rev. 2006 May;106(5):1559-88 [PMID: 16683745]
  52. BMC Evol Biol. 2006;6:43 [PMID: 16737532]
  53. Mol Biol Evol. 2007 Aug;24(8):1586-91 [PMID: 17483113]
  54. Proteins. 2008 Dec;73(4):872-88 [PMID: 18536008]

MeSH Term

Models, Chemical
Models, Genetic
Phylogeny
Protein Folding
Protein Stability
Proteins

Chemicals

Proteins

Word Cloud

Created with Highcharts 10.0.0modelmean-fieldmodelssubstitutionnativeproteindistributionssequencelargerempiricalsitesstabilitysite-specificaminostatelikelihooddatadivergenceconstraintsacidobservednumbercontactsobtainedaccountmisfoldedconsideraveragesequencesrespectfamiliesbetterstructurallyconstrainedDespiteintenseworkincorporatingstructuresmathematicalmolecularevolutionremainsdifficultprogramsassumeevolveindependentlywhereasmaintainedinteractionsaddressproblemdevelopingnewgeneratesindependentunfoldingmisfoldingdependsbackgrounddistributionacidsoneselectionparameterfixmaximizinganalyticsolutionshowsmaindeterminantsitevariableintermediatetakingconformationsyieldhydrophobicityrealisticproducestableproteinsevaluated12testsetsdifferentcasesprofilespresentedsmallerKullback-LeiblerNextratescombiningfrequenciesresultingassignsstudiedidentity035plausiblyconditionenforcesconservationstructureacrossfamilyfoundperformssimilarhighercomplexitymuchcomplexrecentlydevelopedBordnerMittelmanntakespairwisetermsalsooptimizesexchangeabilitymatrixperformedworsesmallimplementedcomputerprogramProt_Evolfreelyavailablehttp://ubcbmuames/software/Prot_EvolphpMaximum-LikelihoodPhylogeneticInferenceSelectionProteinFoldingStabilityfoldingmaximum-likelihoodestimate

Similar Articles

Cited By