BAMSI: a multi-cloud service for scalable distributed filtering of massive genome data.

Advanced Search

Kristiina Ausmees, Aji John, Salman Z Toor, Andreas Hellander, Carl Nettelblad

Author Information

Kristiina Ausmees: Department of Information Technology, Uppsala University, Box 377, Uppsala, Sweden.
Aji John: Department of Biology, University of Washington, Box 351800, Seattle, 98195, USA.
Salman Z Toor: Department of Information Technology, Uppsala University, Box 377, Uppsala, Sweden.
Andreas Hellander: Department of Information Technology, Uppsala University, Box 377, Uppsala, Sweden.
Carl Nettelblad: Department of Information Technology, Uppsala University, Box 377, Uppsala, Sweden. carl.nettelblad@it.uu.se. ORCID

PMID: 29940842 DOI: 10.1186/s12859-018-2241-z

BACKGROUND: The advent of next-generation sequencing (NGS) has made whole-genome sequencing of cohorts of individuals a reality. Primary datasets of raw or aligned reads of this sort can get very large. For scientific questions where curated called variants are not sufficient, the sheer size of the datasets makes analysis prohibitively expensive. In order to make re-analysis of such data feasible without the need to have access to a large-scale computing facility, we have developed a highly scalable, storage-agnostic framework, an associated API and an easy-to-use web user interface to execute custom filters on large genomic datasets.
RESULTS: We present BAMSI, a Software as-a Service (SaaS) solution for filtering of the 1000 Genomes phase 3 set of aligned reads, with the possibility of extension and customization to other sets of files. Unique to our solution is the capability of simultaneously utilizing many different mirrors of the data to increase the speed of the analysis. In particular, if the data is available in private or public clouds - an increasingly common scenario for both academic and commercial cloud providers - our framework allows for seamless deployment of filtering workers close to data. We show results indicating that such a setup improves the horizontal scalability of the system, and present a possible use case of the framework by performing an analysis of structural variation in the 1000 Genomes data set.
CONCLUSIONS: BAMSI constitutes a framework for efficient filtering of large genomic data sets that is flexible in the use of compute as well as storage resources. The data resulting from the filter is assumed to be greatly reduced in size, and can easily be downloaded or routed into e.g. a Hadoop cluster for subsequent interactive analysis using Hive, Spark or similar tools. In this respect, our framework also suggests a general model for making very large datasets of high scientific value more accessible by offering the possibility for organizations to share the cost of hosting data on hot storage, without compromising the scalability of downstream analysis.

1000 genomes Big data Cloud computing Human genome Next-generation sequencing

Nature. 2008 Sep 11;455(7210):232-6 [PMID: 18668039]
Nat Genet. 2006 Sep;38(9):1038-42 [PMID: 16906162]
Bioinformatics. 2012 Mar 15;28(6):876-7 [PMID: 22302568]
Nature. 2012 Nov 1;491(7422):56-65 [PMID: 23128226]
Nat Genet. 2014 Dec;46(12):1293-302 [PMID: 25326701]
Bioinformatics. 2009 Aug 15;25(16):2078-9 [PMID: 19505943]
Nat Genet. 2009 Feb;41(2):160-2 [PMID: 19136953]
Nat Genet. 2008 Mar;40(3):322-8 [PMID: 18278044]
Nucleic Acids Res. 2016 Jul 8;44(W1):W3-W10 [PMID: 27137889]
Nat Genet. 2005 Feb;37(2):129-37 [PMID: 15654335]
BMC Genomics. 2011 Oct 14;12:507 [PMID: 21999641]

Cloud Computing

Genomics

High-Throughput Nucleotide Sequencing

Humans

Journal Article Research Support, Non-U.S. Gov't

A Fast and Scalable Workflow for SNPs Detection in Genome Sequences Using Hadoop Map-Reduce.

OpenLB
Open Library of Bioscience