SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop.

André Schumacher, Luca Pireddu, Matti Niemenmaa, Aleksi Kallio, Eija Korpelainen, Gianluigi Zanetti, Keijo Heljanko
Author Information
  1. André Schumacher: Aalto University School of Science and Helsinki Institute for Information Technology HIIT, Finland, International Computer Science Institute, Berkeley, CA, USA, CRS4-Center for Advanced Studies, Research and Development in Sardinia, Italy and CSC-IT Center for Science, Finland.

Abstract

SUMMARY: Hadoop MapReduce-based approaches have become increasingly popular due to their scalability in processing large sequencing datasets. However, as these methods typically require in-depth expertise in Hadoop and Java, they are still out of reach of many bioinformaticians. To solve this problem, we have created SeqPig, a library and a collection of tools to manipulate, analyze and query sequencing datasets in a scalable and simple manner. SeqPigscripts use the Hadoop-based distributed scripting engine Apache Pig, which automatically parallelizes and distributes data processing tasks. We demonstrate SeqPig's scalability over many computing nodes and illustrate its use with example scripts.
AVAILABILITY AND IMPLEMENTATION: Available under the open source MIT license at http://sourceforge.net/projects/seqpig/

References

  1. Genome Biol. 2009;10(11):R134 [PMID: 19930550]
  2. BMC Bioinformatics. 2010 Dec 21;11 Suppl 12:S2 [PMID: 21210981]
  3. Bioinformatics. 2011 Aug 1;27(15):2159-60 [PMID: 21697132]
  4. Genome Biol. 2010;11(5):207 [PMID: 20441614]
  5. BMC Bioinformatics. 2010 Dec 21;11 Suppl 12:S1 [PMID: 21210976]
  6. Bioinformatics. 2012 Mar 15;28(6):876-7 [PMID: 22302568]
  7. BMC Bioinformatics. 2012 Aug 13;13:200 [PMID: 22888776]
  8. Bioinformatics. 2013 Dec 1;29(23):3014-9 [PMID: 24021384]
  9. Nature. 2013 Jun 13;498(7453):255-60 [PMID: 23765498]
  10. BMC Genomics. 2011 Aug 18;12:419 [PMID: 21851633]

Grants

  1. 095931/Wellcome Trust

MeSH Term

High-Throughput Screening Assays
Software Design

Word Cloud

Created with Highcharts 10.0.0HadoopsequencingscalabilityprocessinglargedatasetsmanyscalablesimpleusescriptingdataSUMMARY:MapReduce-basedapproachesbecomeincreasinglypopulardueHowevermethodstypicallyrequirein-depthexpertiseJavastillreachbioinformaticianssolveproblemcreatedSeqPiglibrarycollectiontoolsmanipulateanalyzequerymannerSeqPigscriptsHadoop-baseddistributedengineApachePigautomaticallyparallelizesdistributestasksdemonstrateSeqPig'scomputingnodesillustrateexamplescriptsAVAILABILITYANDIMPLEMENTATION:AvailableopensourceMITlicensehttp://sourceforgenet/projects/seqpig/SeqPig:sets

Similar Articles

Cited By (16)