A multi-source domain annotation pipeline for quantitative metagenomic and metatranscriptomic functional profiling.

Ari Ugarte, Riccardo Vicedomini, Juliana Bernardes, Alessandra Carbone
Author Information
  1. Ari Ugarte: Sorbonne Université, UPMC-Univ P6, CNRS, IBPS, Laboratoire de Biologie Computationnelle et Quantitative - UMR 7238, 4 Place Jussieu, Paris, 75005, France.
  2. Riccardo Vicedomini: Sorbonne Université, UPMC-Univ P6, CNRS, IBPS, Laboratoire de Biologie Computationnelle et Quantitative - UMR 7238, 4 Place Jussieu, Paris, 75005, France.
  3. Juliana Bernardes: Sorbonne Université, UPMC-Univ P6, CNRS, IBPS, Laboratoire de Biologie Computationnelle et Quantitative - UMR 7238, 4 Place Jussieu, Paris, 75005, France.
  4. Alessandra Carbone: Sorbonne Université, UPMC-Univ P6, CNRS, IBPS, Laboratoire de Biologie Computationnelle et Quantitative - UMR 7238, 4 Place Jussieu, Paris, 75005, France. Alessandra.Carbone@lip6.fr. ORCID

Abstract

BACKGROUND: Biochemical and regulatory pathways have until recently been thought and modelled within one cell type, one organism and one species. This vision is being dramatically changed by the advent of whole microbiome sequencing studies, revealing the role of symbiotic microbial populations in fundamental biochemical functions. The new landscape we face requires the reconstruction of biochemical and regulatory pathways at the community level in a given environment. In order to understand how environmental factors affect the genetic material and the dynamics of the expression from one environment to another, we want to evaluate the quantity of gene protein sequences or transcripts associated to a given pathway by precisely estimating the abundance of protein domains, their weak presence or absence in environmental samples.
RESULTS: MetaCLADE is a novel profile-based domain annotation pipeline based on a multi-source domain annotation strategy. It applies directly to reads and improves identification of the catalog of functions in microbiomes. MetaCLADE is applied to simulated data and to more than ten metagenomic and metatranscriptomic datasets from different environments where it outperforms InterProScan in the number of annotated domains. It is compared to the state-of-the-art non-profile-based and profile-based methods, UProC and HMM-GRASPx, showing complementary predictions to UProC. A combination of MetaCLADE and UProC improves even further the functional annotation of environmental samples.
CONCLUSIONS: Learning about the functional activity of environmental microbial communities is a crucial step to understand microbial interactions and large-scale environmental impact. MetaCLADE has been explicitly designed for metagenomic and metatranscriptomic data and allows for the discovery of patterns in divergent sequences, thanks to its multi-source strategy. MetaCLADE highly improves current domain annotation methods and reaches a fine degree of accuracy in annotation of very different environments such as soil and marine ecosystems, ancient metagenomes and human tissues.

Keywords

References

  1. Nucleic Acids Res. 2013 Jan;41(Database issue):D344-7 [PMID: 23161676]
  2. Nucleic Acids Res. 2011 Jul;39(Web Server issue):W518-23 [PMID: 21622656]
  3. Bioinformatics. 2013 Sep 1;29(17):2103-11 [PMID: 23782615]
  4. Nucleic Acids Res. 2012 Jan;40(Database issue):D290-301 [PMID: 22127870]
  5. Malar J. 2017 Jun 7;16(1):241 [PMID: 28592293]
  6. Nat Methods. 2011 Dec 25;9(2):173-5 [PMID: 22198341]
  7. PLoS Comput Biol. 2016 Jul 29;12(7):e1005038 [PMID: 27472895]
  8. Curr Opin Microbiol. 2007 Oct;10(5):490-8 [PMID: 17936679]
  9. PLoS One. 2008 Oct 08;3(10):e3373 [PMID: 18841204]
  10. Nucleic Acids Res. 2016 Jan 4;44(D1):D590-4 [PMID: 26656948]
  11. Nucleic Acids Res. 2005 Jul 1;33(Web Server issue):W116-20 [PMID: 15980438]
  12. Nucleic Acids Res. 2015 Jan;43(Database issue):D213-21 [PMID: 25428371]
  13. Nat Rev Microbiol. 2005 Jun;3(6):489-98 [PMID: 15931167]
  14. Proc Natl Acad Sci U S A. 1987 Jul;84(13):4355-8 [PMID: 3474607]
  15. Nucleic Acids Res. 2012 Jan;40(Database issue):D123-9 [PMID: 22086953]
  16. Microbiol Mol Biol Rev. 2004 Dec;68(4):669-85 [PMID: 15590779]
  17. BMC Bioinformatics. 2013 Jun 21;14:202 [PMID: 23800136]
  18. Nucleic Acids Res. 2017 Jan 4;45(D1):D507-D516 [PMID: 27738135]
  19. Adv Protein Chem. 1981;34:167-339 [PMID: 7020376]
  20. Bioinformatics. 2016 Feb 1;32(3):345-53 [PMID: 26458889]
  21. Trends Ecol Evol. 2015 Mar;30(3):161-8 [PMID: 25650350]
  22. Bioinformatics. 2016 Aug 15;32(16):2520-3 [PMID: 27153620]
  23. Science. 2006 Jan 27;311(5760):496-503 [PMID: 16439655]
  24. Nucleic Acids Res. 2010 Nov;38(20):e191 [PMID: 20805240]
  25. Nucleic Acids Res. 2014 Jan;42(Database issue):D600-6 [PMID: 24165880]
  26. ISME J. 2012 Nov;6(11):1985-91 [PMID: 22592822]
  27. Nucleic Acids Res. 2006 Jan 1;34(Database issue):D247-51 [PMID: 16381856]
  28. Nat Rev Genet. 2006 Jul;7(7):510-23 [PMID: 16778835]
  29. Brief Bioinform. 2012 Nov;13(6):696-710 [PMID: 23175748]
  30. Bioinformatics. 2010 Sep 15;26(18):i420-5 [PMID: 20823302]
  31. Bioinformatics. 2009 Oct 15;25(20):2737-8 [PMID: 19696045]
  32. Trends Microbiol. 2005 Sep;13(9):411-5 [PMID: 16043355]
  33. Front Genet. 2015 Dec 17;6:348 [PMID: 26734060]
  34. Bioinformatics. 2001 Sep;17(9):847-8 [PMID: 11590104]
  35. Nature. 2004 Mar 4;428(6978):37-43 [PMID: 14961025]
  36. Microbiome. 2017 Jan 25;5(1):11 [PMID: 28122610]
  37. Nucleic Acids Res. 2014 Jan;42(Database issue):D222-30 [PMID: 24288371]
  38. Bioinformatics. 2015 May 1;31(9):1382-8 [PMID: 25540185]
  39. Nucleic Acids Res. 2014 Jan;42(Database issue):D240-5 [PMID: 24270792]
  40. PLoS Comput Biol. 2009 Aug;5(8):e1000465 [PMID: 19680427]
  41. J Mol Biol. 1990 Oct 5;215(3):403-10 [PMID: 2231712]
  42. Prog Biophys Mol Biol. 1983;42(1):21-78 [PMID: 6353481]
  43. PLoS One. 2015 Nov 11;10(11):e0142102 [PMID: 26561344]
  44. PLoS Biol. 2007 Mar;5(3):e82 [PMID: 17355177]
  45. Environ Microbiol. 2013 Jan;15(1):1-5 [PMID: 22882611]
  46. Nucleic Acids Res. 2016 Jan 4;44(D1):D595-603 [PMID: 26582919]
  47. Nature. 2017 Mar 1;543(7643):51-59 [PMID: 28252066]
  48. Nature. 2010 Mar 4;464(7285):59-65 [PMID: 20203603]
  49. Science. 2015 May 22;348(6237):1261359 [PMID: 25999513]
  50. Proc Int Conf Intell Syst Mol Biol. 1994;2:28-36 [PMID: 7584402]
  51. PLoS Comput Biol. 2011 Oct;7(10):e1002195 [PMID: 22039361]
  52. Nat Biotechnol. 2013 Sep;31(9):814-21 [PMID: 23975157]
  53. Nucleic Acids Res. 2013 Jan;41(Database issue):D387-95 [PMID: 23197656]
  54. PLoS Comput Biol. 2016 Jun 21;12(6):e1004957 [PMID: 27327495]
  55. Bioinformatics. 2005 Apr 1;21(7):951-60 [PMID: 15531603]
  56. Front Microbiol. 2015 Jun 02;6:555 [PMID: 26082770]
  57. Nat Commun. 2011 Dec 13;2:589 [PMID: 22158444]
  58. Methods Mol Biol. 2016;1399:207-33 [PMID: 26791506]
  59. Fold Des. 1998;3(1):11-7 [PMID: 9502316]
  60. Proc Natl Acad Sci U S A. 2015 Mar 17;112(11):E1326-32 [PMID: 25733885]
  61. Brief Bioinform. 2012 Nov;13(6):711-27 [PMID: 22772835]
  62. PLoS Comput Biol. 2016 Jul 11;12(7):e1004991 [PMID: 27400380]
  63. Mol Microbiol. 1994 Jun;12(6):993-1004 [PMID: 7934906]
  64. Microbiome. 2015 Aug 05;3:32 [PMID: 26246894]
  65. Database (Oxford). 2012 Apr 15;2012:bas019 [PMID: 22508994]
  66. BMC Bioinformatics. 2005;6 Suppl 1:S17 [PMID: 15960829]
  67. Mol Syst Biol. 2008;4:198 [PMID: 18523433]
  68. Nucleic Acids Res. 2011 Jan;39(Database issue):D546-51 [PMID: 21045053]
  69. Nucleic Acids Res. 1997 Sep 1;25(17):3389-402 [PMID: 9254694]
  70. BMC Proc. 2011 May 28;5 Suppl 2:S9 [PMID: 21554767]

MeSH Term

Algorithms
Bacteria
Bacterial Proteins
Databases, Genetic
Environmental Microbiology
Gastrointestinal Microbiome
Humans
Metagenome
Metagenomics
Molecular Sequence Annotation
Protein Domains

Chemicals

Bacterial Proteins

Word Cloud

Created with Highcharts 10.0.0annotationenvironmentalMetaCLADEonedomainmicrobialmulti-sourceimprovesmetagenomicmetatranscriptomicUProCfunctionalregulatorypathwaysbiochemicalfunctionsgivenenvironmentunderstandproteinsequencesdomainssamplesprofile-basedpipelinestrategydatadifferentenvironmentsmethodsBACKGROUND:BiochemicalrecentlythoughtmodelledwithincelltypeorganismspeciesvisiondramaticallychangedadventwholemicrobiomesequencingstudiesrevealingrolesymbioticpopulationsfundamentalnewlandscapefacerequiresreconstructioncommunitylevelorderfactorsaffectgeneticmaterialdynamicsexpressionanotherwantevaluatequantitygenetranscriptsassociatedpathwaypreciselyestimatingabundanceweakpresenceabsenceRESULTS:novelbasedappliesdirectlyreadsidentificationcatalogmicrobiomesappliedsimulatedtendatasetsoutperformsInterProScannumberannotatedcomparedstate-of-the-artnon-profile-basedHMM-GRASPxshowingcomplementarypredictionscombinationevenCONCLUSIONS:Learningactivitycommunitiescrucialstepinteractionslarge-scaleimpactexplicitlydesignedallowsdiscoverypatternsdivergentthankshighlycurrentreachesfinedegreeaccuracysoilmarineecosystemsancientmetagenomeshumantissuesquantitativeprofilingDomainEnvironmentFunctionalMetagenomicMetatranscriptomicMotifProbabilisticmodel

Similar Articles

Cited By