Halvade: scalable sequence analysis with MapReduce.

Advanced Search

Dries Decap, Joke Reumers, Charlotte Herzeel, Pascal Costanza, Jan Fostier

Author Information

Dries Decap: Department of Information Technology, Ghent University - iMinds, Gaston Crommenlaan 8 bus 201, 9050 Ghent, Belgium, ExaScience Life Lab, Kapeldreef 75, 3001 Leuven, Belgium.
Joke Reumers: ExaScience Life Lab, Kapeldreef 75, 3001 Leuven, Belgium, Janssen Research & Development, a division of Janssen Pharmaceutica N.V., 2340 Beerse, Belgium.
Charlotte Herzeel: ExaScience Life Lab, Kapeldreef 75, 3001 Leuven, Belgium, Imec, Kapeldreef 75, 3001 Leuven, Belgium, and.
Pascal Costanza: ExaScience Life Lab, Kapeldreef 75, 3001 Leuven, Belgium, Intel Corporation Belgium.
Jan Fostier: Department of Information Technology, Ghent University - iMinds, Gaston Crommenlaan 8 bus 201, 9050 Ghent, Belgium, ExaScience Life Lab, Kapeldreef 75, 3001 Leuven, Belgium.

PMID: 25819078 DOI: 10.1093/bioinformatics/btv179

MOTIVATION: Post-sequencing DNA analysis typically consists of read mapping followed by variant calling. Especially for whole genome sequencing, this computational step is very time-consuming, even when using multithreading on a multi-core machine.
RESULTS: We present Halvade, a framework that enables sequencing pipelines to be executed in parallel on a multi-node and/or multi-core compute infrastructure in a highly efficient manner. As an example, a DNA sequencing analysis pipeline for variant calling has been implemented according to the GATK Best Practices recommendations, supporting both whole genome and whole exome sequencing. Using a 15-node computer cluster with 360 CPU cores in total, Halvade processes the NA12878 dataset (human, 100 bp paired-end reads, 50× coverage) in <3 h with very high parallel efficiency. Even on a single, multi-core machine, Halvade attains a significant speedup compared with running the individual tools with multithreading.

Genome Biol. 2009;10(11):R134 [PMID: 19930550]
Bioinformatics. 2012 Mar 15;28(6):876-7 [PMID: 22302568]
Bioinformatics. 2012 Dec 15;28(24):3169-77 [PMID: 23060614]
J Genet Genomics. 2011 Mar 20;38(3):95-109 [PMID: 21477781]
Nucleic Acids Res. 2001 Jan 1;29(1):308-11 [PMID: 11125122]
Curr Protoc Bioinformatics. 2013;43:11.10.1-33 [PMID: 25431634]
Genome Res. 2010 Sep;20(9):1297-303 [PMID: 20644199]
Bioinformatics. 2014 Jun 1;30(11):1508-13 [PMID: 24526712]
Nat Rev Genet. 2011 Jun;12(6):443-51 [PMID: 21587300]
Bioinformatics. 2009 Aug 15;25(16):2078-9 [PMID: 19505943]
Bioinformatics. 2010 Mar 15;26(6):841-2 [PMID: 20110278]
Nat Genet. 2011 May;43(5):491-8 [PMID: 21478889]
Bioinformatics. 2009 Jun 1;25(11):1363-9 [PMID: 19357099]
Genome Biol. 2009;10(3):R25 [PMID: 19261174]
Bioinformatics. 2009 Jul 15;25(14):1754-60 [PMID: 19451168]
Bioinformatics. 2008 Mar 1;24(5):713-4 [PMID: 18227114]
PLoS One. 2013;8(8):e72614 [PMID: 24009693]

Genome, Human

Humans

Sequence Analysis, DNA

Software

Journal Article Research Support, Non-U.S. Gov't

OpenLB
Open Library of Bioscience