Shared data science infrastructure for genomics data.

Advanced Search

Hamid Bagheri, Usha Muppirala, Rick E Masonbrink, Andrew J Severin, Hridesh Rajan

Author Information

Hamid Bagheri: Department of Computer Science, Iowa State University, 226 Atanasoff Hall, Ames, 50011, USA. hbagheri@iastate.edu. ORCID
Usha Muppirala: Genome Informatics Facility, Iowa State University, 206 Science I, Ames, 50011, USA.
Rick E Masonbrink: Genome Informatics Facility, Iowa State University, 206 Science I, Ames, 50011, USA.
Andrew J Severin: Genome Informatics Facility, Iowa State University, 206 Science I, Ames, 50011, USA.
Hridesh Rajan: Department of Computer Science, Iowa State University, 226 Atanasoff Hall, Ames, 50011, USA.

PMID: 31438850 DOI: 10.1186/s12859-019-2967-2

BACKGROUND: Creating a scalable computational infrastructure to analyze the wealth of information contained in data repositories is difficult due to significant barriers in organizing, extracting and analyzing relevant data. Shared data science infrastructures like Boa is needed to efficiently process and parse data contained in large data repositories. The main features of Boa are inspired from existing languages for data intensive computing and can easily integrate data from biological data repositories.
RESULTS: As a proof of concept, Boa for genomics, Boa, has been implemented to analyze RefSeq's 153,848 annotation (GFF) and assembly (FASTA) file metadata. Boa provides a massive improvement from existing solutions like Python and MongoDB, by utilizing a domain-specific language that uses Hadoop infrastructure for a smaller storage footprint that scales well and requires fewer lines of code. We execute scripts through Boa to answer questions about the genomes in RefSeq. We identify the largest and smallest genomes deposited, explore exon frequencies for assemblies after 2016, identify the most commonly used bacterial genome assembly program, and address how animal genome assemblies have improved since 2016. Boa databases provide a significant reduction in required storage of the raw data and a significant speed up in its ability to query large datasets due to automated parallelization and distribution of Hadoop infrastructure during computations.
CONCLUSIONS: In order to keep pace with our ability to produce biological data, innovative methods are required. The Shared Data Science Infrastructure, Boa, provides researchers a greater access to researchers to efficiently explore data in new ways. We demonstrate the potential of a the domain specific language Boa using the RefSeq database to explore how deposited genome assemblies and annotations are changing over time. This is a small example of how Boa could be used with large biological datasets.

Boag Domain-Specific Language Genome Annotation Shared Data Science Infrastructure

Bioinformatics. 2009 Jun 1;25(11):1422-3 [PMID: 19304878]
Int J Bioinform Res Appl. 2010;6(5):472-83 [PMID: 21224205]
Nucleic Acids Res. 2007 Jan;35(Database issue):D61-5 [PMID: 17130148]
Bioinformatics. 2017 Apr 15;33(8):1216-1217 [PMID: 28110292]
BMC Bioinformatics. 2011 Jul 14;12:285 [PMID: 21756325]
Genome Res. 2002 Oct;12(10):1611-8 [PMID: 12368254]
Brief Bioinform. 2020 Jan 17;21(1):96-105 [PMID: 30462158]
BMC Bioinformatics. 2010 Dec 21;11 Suppl 12:S1 [PMID: 21210976]
BMC Genomics. 2009 Jan 14;10:22 [PMID: 19144180]
Genome Biol. 2010;11(8):R83 [PMID: 20701754]
Nucleic Acids Res. 2008 Dec;36(21):6688-719 [PMID: 18948295]
BMJ. 2018 Apr 24;361:k1687 [PMID: 29691228]
J Mol Biol. 1990 Oct 5;215(3):403-10 [PMID: 2231712]
Bioinformatics. 2012 Oct 15;28(20):2693-5 [PMID: 22877863]
Drug Discov Today. 2017 Apr;22(4):712-717 [PMID: 28163155]
Bioinformatics. 2012 Mar 15;28(6):876-7 [PMID: 22302568]

(CCF-15-18897)/National Science Foundation (US)
(CNS-15-13263)/National Science Foundation
(Presidential Initiative)/Iowa State University

Animals

Data Science

Databases, Factual

Databases, Genetic

Exons

Genome

Genomics

Information Dissemination

Sequence Analysis, DNA

Software

Journal Article

clearScience: Infrastructure for Communicating Data-Intensive Science.Data Grids: a new computational infrastructure for data-intensive science.Shared metadata for data-centric materials science.Research data infrastructure for high-throughput experimental materials science.Information science. Standards and infrastructure for innovation data exchange.A social science data-fusion tool and the Data Management through e-Social Science (DAMES) infrastructure.PGP repository: a plant phenomics and genomics data publication infrastructure.Data hosting infrastructure for primary biodiversity data.Shared Decision Making: From Decision Science to Data Science.High-quality science requires high-quality open data infrastructure.

Detecting and correcting misclassified sequences in the large-scale public databases.Shifting Gears in Precision Oncology-Challenges and Opportunities of Integrative Data Analysis.

OpenLB
Open Library of Bioscience