Halvade-RNA: Parallel variant calling from transcriptomic data using MapReduce.

Dries Decap, Joke Reumers, Charlotte Herzeel, Pascal Costanza, Jan Fostier
Author Information
  1. Dries Decap: Department of Information Technology, IDLab, Ghent University - imec, Ghent, Belgium. ORCID
  2. Joke Reumers: Janssen Research & Development, a division of Janssen Pharmaceutica N.V., Beerse, Belgium.
  3. Charlotte Herzeel: Imec, Leuven, Belgium.
  4. Pascal Costanza: Intel Corporation Belgium, Leuven, Belgium.
  5. Jan Fostier: Department of Information Technology, IDLab, Ghent University - imec, Ghent, Belgium.

Abstract

Given the current cost-effectiveness of next-generation sequencing, the amount of DNA-seq and RNA-seq data generated is ever increasing. One of the primary objectives of NGS experiments is calling genetic variants. While highly accurate, most variant calling pipelines are not optimized to run efficiently on large data sets. However, as variant calling in genomic data has become common practice, several methods have been proposed to reduce runtime for DNA-seq analysis through the use of parallel computing. Determining the effectively expressed variants from transcriptomics (RNA-seq) data has only recently become possible, and as such does not yet benefit from efficiently parallelized workflows. We introduce Halvade-RNA, a parallel, multi-node RNA-seq variant calling pipeline based on the GATK Best Practices recommendations. Halvade-RNA makes use of the MapReduce programming model to create and manage parallel data streams on which multiple instances of existing tools such as STAR and GATK operate concurrently. Whereas the single-threaded processing of a typical RNA-seq sample requires ∼28h, Halvade-RNA reduces this runtime to ∼2h using a small cluster with two 20-core machines. Even on a single, multi-core workstation, Halvade-RNA can significantly reduce runtime compared to using multi-threading, thus providing for a more cost-effective processing of RNA-seq data. Halvade-RNA is written in Java and uses the Hadoop MapReduce 2.0 API. It supports a wide range of distributions of Hadoop, including Cloudera and Amazon EMR.

References

  1. Bioinformatics. 2012 Mar 15;28(6):876-7 [PMID: 22302568]
  2. Nat Methods. 2015 Oct;12(10):966-8 [PMID: 26258291]
  3. Nat Biotechnol. 2012 Mar 07;30(3):226-9 [PMID: 22398614]
  4. Bioinformatics. 2013 Jan 1;29(1):15-21 [PMID: 23104886]
  5. Brief Bioinform. 2014 Jul;15(4):637-47 [PMID: 23396756]
  6. Genome Biol. 2015 Jan 20;16:6 [PMID: 25600152]
  7. Curr Protoc Bioinformatics. 2013;43:11.10.1-33 [PMID: 25431634]
  8. Genome Res. 2010 Sep;20(9):1297-303 [PMID: 20644199]
  9. Bioinformatics. 2014 Jun 1;30(11):1508-13 [PMID: 24526712]
  10. Bioinformatics. 2015 Aug 1;31(15):2482-8 [PMID: 25819078]
  11. Nat Genet. 2011 May;43(5):491-8 [PMID: 21478889]
  12. Bioinformatics. 2009 Jun 1;25(11):1363-9 [PMID: 19357099]
  13. Nat Methods. 2013 Dec;10(12):1185-91 [PMID: 24185836]
  14. Bioinformatics. 2015 Aug 1;31(15):2475-81 [PMID: 25812743]
  15. Nature. 2012 Mar 28;483(7391):603-7 [PMID: 22460905]
  16. Bioinformatics. 2014 Apr 1;30(7):923-30 [PMID: 24227677]
  17. Am J Hum Genet. 2013 Oct 3;93(4):641-51 [PMID: 24075185]

MeSH Term

Algorithms
Computational Biology
Genomics
High-Throughput Nucleotide Sequencing
Polymorphism, Single Nucleotide
RNA
Sequence Analysis, DNA
Software
Transcriptome

Chemicals

RNA

Word Cloud

Created with Highcharts 10.0.0dataRNA-seqcallingHalvade-RNAvariantruntimeparallelMapReduceusingDNA-seqvariantsefficientlybecomereduceuseGATKprocessingHadoopGivencurrentcost-effectivenessnext-generationsequencingamountgeneratedeverincreasingOneprimaryobjectivesNGSexperimentsgenetichighlyaccuratepipelinesoptimizedrunlargesetsHowevergenomiccommonpracticeseveralmethodsproposedanalysiscomputingDeterminingeffectivelyexpressedtranscriptomicsrecentlypossibleyetbenefitparallelizedworkflowsintroducemulti-nodepipelinebasedBestPracticesrecommendationsmakesprogrammingmodelcreatemanagestreamsmultipleinstancesexistingtoolsSTARoperateconcurrentlyWhereassingle-threadedtypicalsamplerequires∼28hreduces∼2hsmallclustertwo20-coremachinesEvensinglemulti-coreworkstationcansignificantlycomparedmulti-threadingthusprovidingcost-effectivewrittenJavauses20APIsupportswiderangedistributionsincludingClouderaAmazonEMRHalvade-RNA:Paralleltranscriptomic

Similar Articles

Cited By