Improvement in Protein Domain Identification Is Reached by Breaking Consensus, with the Agreement of Many Profiles and Domain Co-occurrence.

Juliana Bernardes, Gerson Zaverucha, Catherine Vaquero, Alessandra Carbone
Author Information
  1. Juliana Bernardes: Sorbonne Universités, UPMC Univ-Paris 6, CNRS, UMR 7238, Laboratoire de Biologie Computationnelle et Quantitative, Paris, France.
  2. Gerson Zaverucha: COPPE, Programa de Engenharia de Sistemas e Computação, Universidade Federal do Rio de Janeiro, Rio de Janeiro, Brazil.
  3. Catherine Vaquero: Sorbonne Universités, UPMC Univ-Paris 6, INSERM U1135, CNRS ERL 8255, Centre d'Immunologie et des Maladies Infectieuses (CIMI-Paris), Paris, France.
  4. Alessandra Carbone: Sorbonne Universités, UPMC Univ-Paris 6, CNRS, UMR 7238, Laboratoire de Biologie Computationnelle et Quantitative, Paris, France.

Abstract

Traditional protein annotation methods describe known domains with probabilistic models representing consensus among homologous domain sequences. However, when relevant signals become too weak to be identified by a global consensus, attempts for annotation fail. Here we address the fundamental question of domain identification for highly divergent proteins. By using high performance computing, we demonstrate that the limits of state-of-the-art annotation methods can be bypassed. We design a new strategy based on the observation that many structural and functional protein constraints are not globally conserved through all species but might be locally conserved in separate clades. We propose a novel exploitation of the large amount of data available: 1. for each known protein domain, several probabilistic clade-centered models are constructed from a large and differentiated panel of homologous sequences, 2. a decision-making protocol combines outcomes obtained from multiple models, 3. a multi-criteria optimization algorithm finds the most likely protein architecture. The method is evaluated for domain and architecture prediction over several datasets and statistical testing hypotheses. Its performance is compared against HMMScan and HHblits, two widely used search methods based on sequence-profile and profile-profile comparison. Due to their closeness to actual protein sequences, clade-centered models are shown to be more specific and functionally predictive than the broadly used consensus models. Based on them, we improved annotation of Plasmodium falciparum protein sequences on a scale not previously possible. We successfully predict at least one domain for 72% of P. falciparum proteins against 63% achieved previously, corresponding to 30% of improvement over the total number of Pfam domain predictions on the whole genome. The method is applicable to any genome and opens new avenues to tackle evolutionary questions such as the reconstruction of ancient domain duplications, the reconstruction of the history of protein architectures, and the estimation of protein domain age. Website and software: http://www.lcqb.upmc.fr/CLADE.

References

  1. J Mol Biol. 2005 Nov 4;353(4):911-23 [PMID: 16198373]
  2. J Mol Biol. 1987 Jul 5;196(1):199-216 [PMID: 3656444]
  3. BMC Evol Biol. 2003 Jan 6;3:2 [PMID: 12515582]
  4. Genome Res. 2002 Oct;12(10):1619-23 [PMID: 12368255]
  5. Biotechniques. 2000 Jun;28(6):1102, 1104 [PMID: 10868275]
  6. Nat Methods. 2011 Dec 25;9(2):173-5 [PMID: 22198341]
  7. J Theor Biol. 2008 Jul 21;253(2):375-80 [PMID: 18423492]
  8. Bioinformatics. 2010 Mar 15;26(6):745-51 [PMID: 20118117]
  9. Proc Natl Acad Sci U S A. 1986 Mar;83(5):1271-5 [PMID: 2419905]
  10. Protein Eng. 1999 Feb;12(2):85-94 [PMID: 10195279]
  11. Protein Sci. 2003 Oct;12(10):2262-72 [PMID: 14500884]
  12. Proc Natl Acad Sci U S A. 1999 Apr 13;96(8):4285-8 [PMID: 10200254]
  13. Bioinformatics. 2016 Feb 1;32(3):345-53 [PMID: 26458889]
  14. Genome Res. 2004 Oct;14(10A):1957-66 [PMID: 15466294]
  15. Nucleic Acids Res. 2010 Jan;38(Database issue):D211-22 [PMID: 19920124]
  16. Science. 1999 Jul 30;285(5428):751-3 [PMID: 10427000]
  17. Proteins. 2001 Dec 1;45(4):360-71 [PMID: 11746684]
  18. Trends Ecol Evol. 2005 Dec;20(12):670-6 [PMID: 16701456]
  19. Nucleic Acids Res. 2014 Jan;42(Database issue):D304-9 [PMID: 24304899]
  20. Nucleic Acids Res. 2013 Jan;41(Database issue):D377-86 [PMID: 23193289]
  21. BMC Genomics. 2007 Jul 27;8:255 [PMID: 17662120]
  22. Mol Biol Evol. 2010 Feb;27(2):221-4 [PMID: 19854763]
  23. Bioinformatics. 2006 Jun 15;22(12):1418-23 [PMID: 16601004]
  24. Protein Eng. 1986 Oct-Nov;1(1):77-8 [PMID: 3507691]
  25. Nucleic Acids Res. 2006 Jan 1;34(Database issue):D247-51 [PMID: 16381856]
  26. Nucleic Acids Res. 2009 Jul;37(Web Server issue):W48-52 [PMID: 19420063]
  27. BMC Genomics. 2005 Jul 23;6:100 [PMID: 16042788]
  28. Comput Appl Biosci. 1996 Apr;12(2):95-107 [PMID: 8744772]
  29. Proteins. 1991;9(1):56-68 [PMID: 2017436]
  30. BMC Genomics. 2010 Jan 15;11:34 [PMID: 20078850]
  31. BMC Bioinformatics. 2007 Oct 16;8:390 [PMID: 17937820]
  32. Nucleic Acids Res. 2014 Jan;42(Database issue):D240-5 [PMID: 24270792]
  33. J Mol Biol. 2001 Nov 2;313(4):903-19 [PMID: 11697912]
  34. Trends Biochem Sci. 2008 Sep;33(9):444-51 [PMID: 18656364]
  35. J Mol Biol. 2001 Jul 6;310(2):311-25 [PMID: 11428892]
  36. Bioinformatics. 2009 Dec 1;25(23):3077-83 [PMID: 19786484]
  37. PLoS Comput Biol. 2011 Oct;7(10):e1002195 [PMID: 22039361]
  38. BMC Evol Biol. 2005 Mar 23;5:24 [PMID: 15788102]
  39. J Mol Biol. 2004 Feb 20;336(3):809-23 [PMID: 15095989]
  40. Genome Biol. 2001;2(7):Comment 2006 [PMID: 11521679]
  41. J Mol Biol. 1995 Apr 7;247(4):536-40 [PMID: 7723011]
  42. Proteins. 2007 May 15;67(3):695-708 [PMID: 17299747]
  43. Genome Res. 2006 Apr;16(4):542-9 [PMID: 16520460]
  44. Bioinformatics. 2005 Apr 1;21(7):951-60 [PMID: 15531603]
  45. J Mol Biol. 2002 Feb 1;315(5):1257-75 [PMID: 11827492]
  46. J Mol Biol. 2004 May 7;338(4):847-54 [PMID: 15099750]
  47. BMC Bioinformatics. 2011 Mar 31;12:90 [PMID: 21453511]
  48. Genome Biol. 2009 Feb 02;10(2):207 [PMID: 19226439]
  49. Nucleic Acids Res. 2012 Jan;40(Database issue):D465-71 [PMID: 22139938]
  50. Nucleic Acids Res. 1994 Nov 11;22(22):4673-80 [PMID: 7984417]
  51. Nucleic Acids Res. 2009 Jan;37(Database issue):D539-43 [PMID: 18957442]
  52. PLoS One. 2009 Dec 21;4(12):e8378 [PMID: 20041107]
  53. Bioinformatics. 2007 Nov 1;23(21):2947-8 [PMID: 17846036]
  54. Nucleic Acids Res. 1997 Sep 1;25(17):3389-402 [PMID: 9254694]
  55. BMC Bioinformatics. 2011 Mar 23;12:83 [PMID: 21429187]
  56. Nucleic Acids Res. 2004 Jan 1;32(Database issue):D115-9 [PMID: 14681372]
  57. Proc Natl Acad Sci U S A. 2003 Apr 15;100(8):4516-20 [PMID: 12668763]

MeSH Term

Amino Acid Sequence
Computational Biology
Consensus Sequence
Databases, Protein
Plasmodium falciparum
Protein Domains
Proteins
Protozoan Proteins
Sequence Alignment
Sequence Analysis, Protein

Chemicals

Proteins
Protozoan Proteins

Word Cloud

Created with Highcharts 10.0.0proteindomainmodelsannotationsequencesmethodsconsensusknownprobabilistichomologousproteinsperformancenewbasedconservedlargeseveralclade-centeredarchitecturemethodusedfalciparumpreviouslygenomereconstructionDomainTraditionaldescribedomainsrepresentingamongHoweverrelevantsignalsbecomeweakidentifiedglobalattemptsfailaddressfundamentalquestionidentificationhighlydivergentusinghighcomputingdemonstratelimitsstate-of-the-artcanbypasseddesignstrategyobservationmanystructuralfunctionalconstraintsgloballyspeciesmightlocallyseparatecladesproposenovelexploitationamountdataavailable:1constructeddifferentiatedpanel2decision-makingprotocolcombinesoutcomesobtainedmultiple3multi-criteriaoptimizationalgorithmfindslikelyevaluatedpredictiondatasetsstatisticaltestinghypothesescomparedHMMScanHHblitstwowidelysearchsequence-profileprofile-profilecomparisonDueclosenessactualshownspecificfunctionallypredictivebroadlyBasedimprovedPlasmodiumscalepossiblesuccessfullypredictleastone72%P63%achievedcorresponding30%improvementtotalnumberPfampredictionswholeapplicableopensavenuestackleevolutionaryquestionsancientduplicationshistoryarchitecturesestimationageWebsitesoftware:http://wwwlcqbupmcfr/CLADEImprovementProteinIdentificationReachedBreakingConsensusAgreementManyProfilesCo-occurrence

Similar Articles

Cited By