Biospark: scalable analysis of large numerical datasets from biological simulations and experiments using Hadoop and Spark.

Advanced Search

Max Klein, Rati Sharma, Chris H Bohrer, Cameron M Avelis, Elijah Roberts

Author Information

Max Klein: Department of Biophysics, Johns Hopkins University, Baltimore, MD 21218, USA.
Rati Sharma: Department of Biophysics, Johns Hopkins University, Baltimore, MD 21218, USA.
Chris H Bohrer: Department of Biophysics, Johns Hopkins University, Baltimore, MD 21218, USA.
Cameron M Avelis: Department of Biophysics, Johns Hopkins University, Baltimore, MD 21218, USA.
Elijah Roberts: Department of Biophysics, Johns Hopkins University, Baltimore, MD 21218, USA.

PMID: 27663493 DOI: 10.1093/bioinformatics/btw614

Data-parallel programming techniques can dramatically decrease the time needed to analyze large datasets. While these methods have provided significant improvements for sequencing-based analyses, other areas of biological informatics have not yet adopted them. Here, we introduce Biospark, a new framework for performing data-parallel analysis on large numerical datasets. Biospark builds upon the open source Hadoop and Spark projects, bringing domain-specific features for biology.
AVAILABILITY AND IMPLEMENTATION: Source code is licensed under the Apache 2.0 open source license and is available at the project website: https://www.assembla.com/spaces/roberts-lab-public/wiki/Biospark CONTACT: eroberts@jhu.eduSupplementary information: Supplementary data are available at Bioinformatics online.

Genome Res. 2010 Sep;20(9):1297-303 [PMID: 20644199]
Bioinformatics. 2012 Mar 15;28(6):876-7 [PMID: 22302568]
J Biomed Inform. 2013 Oct;46(5):774-81 [PMID: 23872175]
Bioinformatics. 2013 Dec 1;29(23):3014-9 [PMID: 24021384]

T32 GM008403/NIGMS NIH HHS

Computational Biology

Computer Simulation

Microscopy

Software

Journal Article

OpenLB
Open Library of Bioscience