Hadoop-BAM: directly manipulating next generation sequencing data in the cloud.

Matti Niemenmaa, Aleksi Kallio, André Schumacher, Petri Klemelä, Eija Korpelainen, Keijo Heljanko
Author Information
  1. Matti Niemenmaa: Aalto University, Department of Information and Computer Science, Aalto, Finland. matti.niemenmaa@aalto.fi

Abstract

Hadoop-BAM is a novel library for the scalable manipulation of aligned next-generation sequencing data in the Hadoop distributed computing framework. It acts as an integration layer between analysis applications and BAM files that are processed using Hadoop. Hadoop-BAM solves the issues related to BAM data access by presenting a convenient API for implementing map and reduce functions that can directly operate on BAM records. It builds on top of the Picard SAM JDK, so tools that rely on the Picard API are expected to be easily convertible to support large-scale distributed processing. In this article we demonstrate the use of Hadoop-BAM by building a coverage summarizing tool for the Chipster genome browser. Our results show that Hadoop offers good scalability, and one should avoid moving data in and out of Hadoop between analysis steps.

References

  1. BMC Bioinformatics. 2010 Dec 21;11 Suppl 12:S2 [PMID: 21210981]
  2. Genome Res. 2010 Sep;20(9):1297-303 [PMID: 20644199]
  3. Bioinformatics. 2011 Aug 1;27(15):2159-60 [PMID: 21697132]
  4. BMC Bioinformatics. 2010 Dec 21;11 Suppl 12:S1 [PMID: 21210976]
  5. BMC Genomics. 2011 Oct 14;12:507 [PMID: 21999641]
  6. Bioinformatics. 2009 Aug 15;25(16):2078-9 [PMID: 19505943]

MeSH Term

Genome
High-Throughput Nucleotide Sequencing
Sequence Analysis, DNA
Software
User-Computer Interface

Word Cloud

Created with Highcharts 10.0.0dataHadoopHadoop-BAMBAMsequencingdistributedanalysisAPIdirectlyPicardnovellibraryscalablemanipulationalignednext-generationcomputingframeworkactsintegrationlayerapplicationsfilesprocessedusingsolvesissuesrelatedaccesspresentingconvenientimplementingmapreducefunctionscanoperaterecordsbuildstopSAMJDKtoolsrelyexpectedeasilyconvertiblesupportlarge-scaleprocessingarticledemonstrateusebuildingcoveragesummarizingtoolChipstergenomebrowserresultsshowoffersgoodscalabilityoneavoidmovingstepsHadoop-BAM:manipulatingnextgenerationcloud

Similar Articles

Cited By (39)