Scalable metagenomics alignment research tool (SMART): a scalable, rapid, and complete search heuristic for the classification of metagenomic sequences from complex sequence populations.

Aaron Y Lee, Cecilia S Lee, Russell N Van Gelder
Author Information
  1. Aaron Y Lee: Department of Ophthalmology, University of Washington School of Medicine, Box 359608, 325 Ninth Avenue, Seattle, WA, 98104, USA. leeay@uw.edu.
  2. Cecilia S Lee: Department of Ophthalmology, University of Washington School of Medicine, Box 359608, 325 Ninth Avenue, Seattle, WA, 98104, USA.
  3. Russell N Van Gelder: Department of Ophthalmology, University of Washington School of Medicine, Box 359608, 325 Ninth Avenue, Seattle, WA, 98104, USA.

Abstract

BACKGROUND: Next generation sequencing technology has enabled characterization of metagenomics through massively parallel genomic DNA sequencing. The complexity and diversity of environmental samples such as the human gut microflora, combined with the sustained exponential growth in sequencing capacity, has led to the challenge of identifying microbial organisms by DNA sequence. We sought to validate a Scalable Metagenomics Alignment Research Tool (SMART), a novel searching heuristic for shotgun metagenomics sequencing results.
RESULTS: After retrieving all genomic DNA sequences from the NCBI GenBank, over 1 × 10(11) base pairs of 3.3 × 10(6) sequences from 9.25 × 10(5) species were indexed using 4 base pair hashtable shards. A MapReduce searching strategy was used to distribute the search workload in a computing cluster environment. In addition, a one base pair permutation algorithm was used to account for single nucleotide polymorphisms and sequencing errors. Simulated datasets used to evaluate Kraken, a similar metagenomics classification tool, were used to measure and compare precision and accuracy. Finally using a same set of training sequences we compared Kraken, CLARK, and SMART within the same computing environment. Utilizing 12 computational nodes, we completed the classification of all datasets in under 10 min each using exact matching with an average throughput of over 1.95 × 10(6) reads classified per minute. With permutation matching, we achieved sensitivity greater than 83 % and precision greater than 94 % with simulated datasets at the species classification level. We demonstrated the application of this technique applied to conjunctival and gut microbiome metagenomics sequencing results. In our head to head comparison, SMART and CLARK had similar accuracy gains over Kraken at the species classification level, but SMART required approximately half the amount of RAM of CLARK.
CONCLUSIONS: SMART is the first scalable, efficient, and rapid metagenomics classification algorithm capable of matching against all the species and sequences present in the NCBI GenBank and allows for a single step classification of microorganisms as well as large plant, mammalian, or invertebrate genomes from which the metagenomic sample may have been derived.

References

  1. Int J Mol Sci. 2015 Jan 05;16(1):1096-110 [PMID: 25569088]
  2. IEEE Trans Nanobioscience. 2010 Dec;9(4):310-6 [PMID: 20876033]
  3. Nucleic Acids Res. 2002 Jan 1;30(1):17-20 [PMID: 11752243]
  4. Bioinformatics. 2013 Sep 15;29(18):2253-60 [PMID: 23828782]
  5. PLoS One. 2014 Jul 22;9(7):e102642 [PMID: 25050811]
  6. BMC Bioinformatics. 2010 Dec 21;11 Suppl 12:S1 [PMID: 21210976]
  7. Nucleic Acids Res. 2013 Jan 7;41(1):e10 [PMID: 22941661]
  8. Brief Bioinform. 2014 Jul;15(4):637-47 [PMID: 23396756]
  9. J Comput Biol. 2015 Nov 30;: [PMID: 26618474]
  10. BMC Genomics. 2013 Sep 22;14:641 [PMID: 24053649]
  11. Algorithms Mol Biol. 2015 Jan 16;10(1):2 [PMID: 25648210]
  12. IEEE Trans Nanobioscience. 2015 Sep;14(6):608-16 [PMID: 26316190]
  13. Bioinformatics. 2010 Mar 1;26(5):589-95 [PMID: 20080505]
  14. Methods Mol Biol. 2012;856:415-29 [PMID: 22399469]
  15. BMC Bioinformatics. 2010 May 18;11:259 [PMID: 20482786]
  16. BMC Genomics. 2015 Mar 25;16:236 [PMID: 25879410]
  17. Adv Bioinformatics. 2008;2008:205969 [PMID: 19956701]
  18. Genome Res. 2009 Dec;19(12):2317-23 [PMID: 19819907]
  19. Genome Biol. 2009;10(3):R25 [PMID: 19261174]
  20. Algorithms Mol Biol. 2015 Jan 28;10:4 [PMID: 25691913]
  21. Nat Methods. 2009 Sep;6(9):673-6 [PMID: 19648916]
  22. Bioinformatics. 2014 Jan 1;30(1):119-20 [PMID: 24149054]
  23. Bioinformatics. 2011 Nov 1;27(21):2957-63 [PMID: 21903629]
  24. J Bioinform Comput Biol. 2012 Oct;10(5):1250015 [PMID: 22849369]
  25. Genome Biol. 2014 Mar 03;15(3):R46 [PMID: 24580807]
  26. BMC Bioinformatics. 2015 Oct 07;16:323 [PMID: 26446672]
  27. BMC Genomics. 2011;12 Suppl 2:S4 [PMID: 21989143]
  28. Nucleic Acids Res. 1997 Sep 1;25(17):3389-402 [PMID: 9254694]
  29. Genome Res. 2010 Feb;20(2):265-72 [PMID: 20019144]

Grants

  1. K23 EY024921/NEI NIH HHS
  2. P30 EY001730/NEI NIH HHS
  3. R01 EY022038/NEI NIH HHS

MeSH Term

Algorithms
Databases, Nucleic Acid
Heuristics
High-Throughput Nucleotide Sequencing
Humans
Metagenomics
Sequence Analysis, DNA
Software

Word Cloud

Created with Highcharts 10.0.0classificationsequencingmetagenomicsSMARTsequences×10speciesusedDNAbaseusingdatasetsKrakenCLARKmatchinggenomicgutsequenceScalablesearchingheuristicresultsNCBIGenBank136pairsearchcomputingenvironmentpermutationalgorithmsinglesimilartoolprecisionaccuracygreaterlevelheadscalablerapidmetagenomicBACKGROUND:NextgenerationtechnologyenabledcharacterizationmassivelyparallelcomplexitydiversityenvironmentalsampleshumanmicrofloracombinedsustainedexponentialgrowthcapacityledchallengeidentifyingmicrobialorganismssoughtvalidateMetagenomicsAlignmentResearchToolnovelshotgunRESULTS:retrieving11pairs9255indexed4hashtableshardsMapReducestrategydistributeworkloadclusteradditiononeaccountnucleotidepolymorphismserrorsSimulatedevaluatemeasurecompareFinallysettrainingcomparedwithinUtilizing12computationalnodescompleted10 minexactaveragethroughput95readsclassifiedperminuteachievedsensitivity83 %94 %simulateddemonstratedapplicationtechniqueappliedconjunctivalmicrobiomecomparisongainsrequiredapproximatelyhalfamountRAMCONCLUSIONS:firstefficientcapablepresentallowsstepmicroorganismswelllargeplantmammalianinvertebrategenomessamplemayderivedalignmentresearch:completecomplexpopulations

Similar Articles

Cited By