An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics.

Advanced Search

Ronald C Taylor

Author Information

Ronald C Taylor: Computational Biology and Bioinformatics Group, Pacific Northwest National Laboratory, Richland, Washington 99352, USA. ronald.taylor@pnl.gov

PMID: 21210976 DOI: 10.1186/1471-2105-11-S12-S1

BACKGROUND: Bioinformatics researchers are now confronted with analysis of ultra large-scale data sets, a problem that will only increase at an alarming rate in coming years. Recent developments in open source software, that is, the Hadoop project and associated software, provide a foundation for scaling to petabyte scale data warehouses on Linux clusters, providing fault-tolerant parallelized analysis on such data using a programming style named MapReduce.
DESCRIPTION: An overview is given of the current usage within the bioinformatics community of Hadoop, a top-level Apache Software Foundation project, and of associated open source software projects. The concepts behind Hadoop and the associated HBase project are defined, and current bioinformatics software that employ Hadoop is described. The focus is on next-generation sequencing, as the leading application area to date.
CONCLUSIONS: Hadoop and the MapReduce programming paradigm already have a substantial base in the bioinformatics community, especially in the field of next-generation sequencing analysis, and such use is increasing. This is due to the cost-effectiveness of Hadoop-based analysis on commodity Linux clusters, and in the cloud via data upload to cloud vendors who have implemented Hadoop/HBase; and due to the effectiveness and ease-of-use of the MapReduce method in parallelization of many data analysis algorithms.

Genome Biol. 2009;10(11):R134 [PMID: 19930550]
BMC Bioinformatics. 2010 Dec 21;11 Suppl 12:S2 [PMID: 21210981]
Genome Res. 2010 Sep;20(9):1297-303 [PMID: 20644199]
Bioinformatics. 2009 Jun 1;25(11):1363-9 [PMID: 19357099]
Genome Biol. 2010;11(5):207 [PMID: 20441614]
Nat Biotechnol. 2010 Jan;28(1):13-5 [PMID: 20062029]
BMC Bioinformatics. 2010 Dec 21;11 Suppl 12:S3 [PMID: 21210982]
Genome Biol. 2010;11(8):R83 [PMID: 20701754]
Nat Biotechnol. 2010 Jan;28(1):1 [PMID: 20062015]
Genome Biol. 2009;10(3):R25 [PMID: 19261174]
Int J Bioinform Res Appl. 2010;6(5):472-83 [PMID: 21224205]
Genome Biol. 2004;5(10):R80 [PMID: 15461798]
Nat Biotechnol. 2010 Jul;28(7):691-3 [PMID: 20622843]

Algorithms

Cluster Analysis

Computational Biology

High-Throughput Nucleotide Sequencing

Software

Journal Article Research Support, U.S. Gov't, Non-P.H.S.

IGBT Fault Prediction Combining Terminal Characteristics and Artificial Intelligence Neural Network.Efficient Streaming Mass Spatio-Temporal Vehicle Data Access in Urban Sensor Networks Based on Apache Storm.Theoretical and Empirical Comparison of Big Data Image Processing with Apache Hadoop and Sun Grid Engine.A Data Colocation Grid Framework for Big Data Medical Image Processing: Backend Design.Hadoop-BAM: directly manipulating next generation sequencing data in the cloud.An Efficient Middle Layer Platform for Medical Imaging Archives.Predicting the severity of motor neuron disease progression using electronic health record data with a cloud computing Big Data approach.Medical Big Data Warehouse: Architecture and System Design, a Case Study: Improving Healthcare Resources Distribution.Omics AnalySIs System for PRecision Oncology (OASISPRO): a web-based omics analysis tool for clinical phenotype prediction.SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop.

See all "Cited by" articles

OpenLB
Open Library of Bioscience