SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop.

Advanced Search

André Schumacher, Luca Pireddu, Matti Niemenmaa, Aleksi Kallio, Eija Korpelainen, Gianluigi Zanetti, Keijo Heljanko

Author Information

André Schumacher: Aalto University School of Science and Helsinki Institute for Information Technology HIIT, Finland, International Computer Science Institute, Berkeley, CA, USA, CRS4-Center for Advanced Studies, Research and Development in Sardinia, Italy and CSC-IT Center for Science, Finland.

PMID: 24149054 DOI: 10.1093/bioinformatics/btt601

SUMMARY: Hadoop MapReduce-based approaches have become increasingly popular due to their scalability in processing large sequencing datasets. However, as these methods typically require in-depth expertise in Hadoop and Java, they are still out of reach of many bioinformaticians. To solve this problem, we have created SeqPig, a library and a collection of tools to manipulate, analyze and query sequencing datasets in a scalable and simple manner. SeqPigscripts use the Hadoop-based distributed scripting engine Apache Pig, which automatically parallelizes and distributes data processing tasks. We demonstrate SeqPig's scalability over many computing nodes and illustrate its use with example scripts.
AVAILABILITY AND IMPLEMENTATION: Available under the open source MIT license at http://sourceforge.net/projects/seqpig/

Genome Biol. 2009;10(11):R134 [PMID: 19930550]
BMC Bioinformatics. 2010 Dec 21;11 Suppl 12:S2 [PMID: 21210981]
Bioinformatics. 2011 Aug 1;27(15):2159-60 [PMID: 21697132]
Genome Biol. 2010;11(5):207 [PMID: 20441614]
BMC Bioinformatics. 2010 Dec 21;11 Suppl 12:S1 [PMID: 21210976]
Bioinformatics. 2012 Mar 15;28(6):876-7 [PMID: 22302568]
BMC Bioinformatics. 2012 Aug 13;13:200 [PMID: 22888776]
Bioinformatics. 2013 Dec 1;29(23):3014-9 [PMID: 24021384]
Nature. 2013 Jun 13;498(7453):255-60 [PMID: 23765498]
BMC Genomics. 2011 Aug 18;12:419 [PMID: 21851633]

095931/Wellcome Trust

High-Throughput Screening Assays

Software Design

Journal Article Research Support, Non-U.S. Gov't

CloudDOE: a user-friendly tool for deploying Hadoop clouds and analyzing high-throughput sequencing data with MapReduce.A quantitative assessment of the Hadoop framework for analyzing massively parallel DNA sequencing data.Scalable metagenomics alignment research tool (SMART): a scalable, rapid, and complete search heuristic for the classification of metagenomic sequences from complex sequence populations.Benchmarking distributed data warehouse solutions for storing genomic variant information.SeqHBase: a big data toolset for family based sequencing data analysis.HAlign-II: efficient ultra-large multiple sequence alignment and phylogenetic tree reconstruction with distributed and parallel computing.A Large-Scale and Serverless Computational Approach for Improving Quality of NGS Data Supporting Big Multi-Omics Data Analyses.Distributed hybrid-indexing of compressed pan-genomes for scalable and fast sequence alignment.START: a system for flexible analysis of hundreds of genomic signal tracks in few lines of SQL-like queries.Experiences with workflows for automating data-intensive bioinformatics.

See all "Cited by" articles

OpenLB
Open Library of Bioscience