Halvade: scalable sequence analysis with MapReduce.

Dries Decap, Joke Reumers, Charlotte Herzeel, Pascal Costanza, Jan Fostier
Author Information
  1. Dries Decap: Department of Information Technology, Ghent University - iMinds, Gaston Crommenlaan 8 bus 201, 9050 Ghent, Belgium, ExaScience Life Lab, Kapeldreef 75, 3001 Leuven, Belgium.
  2. Joke Reumers: ExaScience Life Lab, Kapeldreef 75, 3001 Leuven, Belgium, Janssen Research & Development, a division of Janssen Pharmaceutica N.V., 2340 Beerse, Belgium.
  3. Charlotte Herzeel: ExaScience Life Lab, Kapeldreef 75, 3001 Leuven, Belgium, Imec, Kapeldreef 75, 3001 Leuven, Belgium, and.
  4. Pascal Costanza: ExaScience Life Lab, Kapeldreef 75, 3001 Leuven, Belgium, Intel Corporation Belgium.
  5. Jan Fostier: Department of Information Technology, Ghent University - iMinds, Gaston Crommenlaan 8 bus 201, 9050 Ghent, Belgium, ExaScience Life Lab, Kapeldreef 75, 3001 Leuven, Belgium.

Abstract

MOTIVATION: Post-sequencing DNA analysis typically consists of read mapping followed by variant calling. Especially for whole genome sequencing, this computational step is very time-consuming, even when using multithreading on a multi-core machine.
RESULTS: We present Halvade, a framework that enables sequencing pipelines to be executed in parallel on a multi-node and/or multi-core compute infrastructure in a highly efficient manner. As an example, a DNA sequencing analysis pipeline for variant calling has been implemented according to the GATK Best Practices recommendations, supporting both whole genome and whole exome sequencing. Using a 15-node computer cluster with 360 CPU cores in total, Halvade processes the NA12878 dataset (human, 100 bp paired-end reads, 50× coverage) in <3 h with very high parallel efficiency. Even on a single, multi-core machine, Halvade attains a significant speedup compared with running the individual tools with multithreading.

References

  1. Genome Biol. 2009;10(11):R134 [PMID: 19930550]
  2. Bioinformatics. 2012 Mar 15;28(6):876-7 [PMID: 22302568]
  3. Bioinformatics. 2012 Dec 15;28(24):3169-77 [PMID: 23060614]
  4. J Genet Genomics. 2011 Mar 20;38(3):95-109 [PMID: 21477781]
  5. Nucleic Acids Res. 2001 Jan 1;29(1):308-11 [PMID: 11125122]
  6. Curr Protoc Bioinformatics. 2013;43:11.10.1-33 [PMID: 25431634]
  7. Genome Res. 2010 Sep;20(9):1297-303 [PMID: 20644199]
  8. Bioinformatics. 2014 Jun 1;30(11):1508-13 [PMID: 24526712]
  9. Nat Rev Genet. 2011 Jun;12(6):443-51 [PMID: 21587300]
  10. Bioinformatics. 2009 Aug 15;25(16):2078-9 [PMID: 19505943]
  11. Bioinformatics. 2010 Mar 15;26(6):841-2 [PMID: 20110278]
  12. Nat Genet. 2011 May;43(5):491-8 [PMID: 21478889]
  13. Bioinformatics. 2009 Jun 1;25(11):1363-9 [PMID: 19357099]
  14. Genome Biol. 2009;10(3):R25 [PMID: 19261174]
  15. Bioinformatics. 2009 Jul 15;25(14):1754-60 [PMID: 19451168]
  16. Bioinformatics. 2008 Mar 1;24(5):713-4 [PMID: 18227114]
  17. PLoS One. 2013;8(8):e72614 [PMID: 24009693]

MeSH Term

Genome, Human
Humans
Sequence Analysis, DNA
Software

Word Cloud

Created with Highcharts 10.0.0sequencinganalysiswholemulti-coreHalvadeDNAvariantcallinggenomemultithreadingmachineparallelMOTIVATION:Post-sequencingtypicallyconsistsreadmappingfollowedEspeciallycomputationalsteptime-consumingevenusingRESULTS:presentframeworkenablespipelinesexecutedmulti-nodeand/orcomputeinfrastructurehighlyefficientmannerexamplepipelineimplementedaccordingGATKBestPracticesrecommendationssupportingexomeUsing15-nodecomputercluster360CPUcorestotalprocessesNA12878datasethuman100 bppaired-endreads50×coverage<3 hhighefficiencyEvensingleattainssignificantspeedupcomparedrunningindividualtoolsHalvade:scalablesequenceMapReduce

Similar Articles

Cited By