Biospark: scalable analysis of large numerical datasets from biological simulations and experiments using Hadoop and Spark.

Max Klein, Rati Sharma, Chris H Bohrer, Cameron M Avelis, Elijah Roberts
Author Information
  1. Max Klein: Department of Biophysics, Johns Hopkins University, Baltimore, MD 21218, USA.
  2. Rati Sharma: Department of Biophysics, Johns Hopkins University, Baltimore, MD 21218, USA.
  3. Chris H Bohrer: Department of Biophysics, Johns Hopkins University, Baltimore, MD 21218, USA.
  4. Cameron M Avelis: Department of Biophysics, Johns Hopkins University, Baltimore, MD 21218, USA.
  5. Elijah Roberts: Department of Biophysics, Johns Hopkins University, Baltimore, MD 21218, USA.

Abstract

Data-parallel programming techniques can dramatically decrease the time needed to analyze large datasets. While these methods have provided significant improvements for sequencing-based analyses, other areas of biological informatics have not yet adopted them. Here, we introduce Biospark, a new framework for performing data-parallel analysis on large numerical datasets. Biospark builds upon the open source Hadoop and Spark projects, bringing domain-specific features for biology.
AVAILABILITY AND IMPLEMENTATION: Source code is licensed under the Apache 2.0 open source license and is available at the project website: https://www.assembla.com/spaces/roberts-lab-public/wiki/Biospark CONTACT: eroberts@jhu.eduSupplementary information: Supplementary data are available at Bioinformatics online.

References

  1. Genome Res. 2010 Sep;20(9):1297-303 [PMID: 20644199]
  2. Bioinformatics. 2012 Mar 15;28(6):876-7 [PMID: 22302568]
  3. J Biomed Inform. 2013 Oct;46(5):774-81 [PMID: 23872175]
  4. Bioinformatics. 2013 Dec 1;29(23):3014-9 [PMID: 24021384]

Grants

  1. T32 GM008403/NIGMS NIH HHS

MeSH Term

Computational Biology
Computer Simulation
Microscopy
Software

Word Cloud

Created with Highcharts 10.0.0largedatasetsbiologicalBiosparkanalysisnumericalopensourceHadoopSparkavailableData-parallelprogrammingtechniquescandramaticallydecreasetimeneededanalyzemethodsprovidedsignificantimprovementssequencing-basedanalysesareasinformaticsyetadoptedintroducenewframeworkperformingdata-parallelbuildsuponprojectsbringingdomain-specificfeaturesbiologyAVAILABILITYANDIMPLEMENTATION:SourcecodelicensedApache20licenseprojectwebsite:https://wwwassemblacom/spaces/roberts-lab-public/wiki/BiosparkCONTACT:eroberts@jhueduSupplementaryinformation:SupplementarydataBioinformaticsonlineBiospark:scalablesimulationsexperimentsusing

Similar Articles

Cited By