Experiences with workflows for automating data-intensive bioinformatics.

Ola Spjuth, Erik Bongcam-Rudloff, Guillermo Carrasco Hernández, Lukas Forer, Mario Giovacchini, Roman Valls Guimera, Aleksi Kallio, Eija Korpelainen, Maciej M Kańduła, Milko Krachunov, David P Kreil, Ognyan Kulev, Paweł P Łabaj, Samuel Lampa, Luca Pireddu, Sebastian Schönherr, Alexey Siretskiy, Dimitar Vassilev
Author Information
  1. Ola Spjuth: Department of Pharmaceutical Biosciences and Science for Life Laboratory, Uppsala University, SE-75124, Uppsala, P.O. Box 591, Sweden. ola.spjuth@farmbio.uu.se.
  2. Erik Bongcam-Rudloff: SLU-Global Bioinformatics Centre, Department of Animal Breeding and Genetics, Swedish University of Agricultural Sciences, Uppsala, Sweden. Erik.Bongcam@slu.se.
  3. Guillermo Carrasco Hernández: Science for Life Laboratory, Karolinska Institutet, SE-17121, Stockholm, P.O. Box 1031, Sweden. guillermo.carrasco@scilifelab.se.
  4. Lukas Forer: Division of Genetic Epidemiology, Medical University of Innsbruck, Innsbruck, 6020, Austria. lukas.forer@i-med.ac.at.
  5. Mario Giovacchini: Science for Life Laboratory, Karolinska Institutet, SE-17121, Stockholm, P.O. Box 1031, Sweden. mario.giovacchini@scilifelab.se.
  6. Roman Valls Guimera: Science for Life Laboratory, Karolinska Institutet, SE-17121, Stockholm, P.O. Box 1031, Sweden. brainstorm@nopcode.org.
  7. Aleksi Kallio: CSC - IT Center for Science Ltd., FI-02101, Espoo, P.O. Box 405, Finland. aleksi.kallio@csc.fi.
  8. Eija Korpelainen: CSC - IT Center for Science Ltd., FI-02101, Espoo, P.O. Box 405, Finland. eija.korpelainen@csc.fi.
  9. Maciej M Kańduła: Chair of Bioinformatics Research Group, Boku University, Vienna, Austria. maciej.kandula@boku.ac.at.
  10. Milko Krachunov: Faculty of Mathematics and Informatics, Sofia University, Sofia, Bulgaria. wfxp@milko.3mhz.net.
  11. David P Kreil: Chair of Bioinformatics Research Group, Boku University, Vienna, Austria. david.kreil@boku.ac.at.
  12. Ognyan Kulev: Faculty of Mathematics and Informatics, Sofia University, Sofia, Bulgaria. okulev@fmi.uni-sofia.bg.
  13. Paweł P Łabaj: Chair of Bioinformatics Research Group, Boku University, Vienna, Austria. pawel.labaj@boku.ac.at.
  14. Samuel Lampa: Department of Pharmaceutical Biosciences and Science for Life Laboratory, Uppsala University, SE-75124, Uppsala, P.O. Box 591, Sweden. samuel.lampa@it.uu.se.
  15. Luca Pireddu: CRS4 Polaris, Pula, Italy. luca.pireddu@crs4.it.
  16. Sebastian Schönherr: Division of Genetic Epidemiology, Medical University of Innsbruck, Innsbruck, 6020, Austria. sebastian.schoenherr@i-med.ac.at.
  17. Alexey Siretskiy: Department of Information Technology, Uppsala University, SE-75105, Uppsala, P.O. Box 337, Sweden. alexey.siretskiy@it.uu.se.
  18. Dimitar Vassilev: AgroBioInstitute and Joint Genomic Centre, Sofia, Bulgaria. jim6329@gmail.com.

Abstract

High-throughput technologies, such as next-generation sequencing, have turned molecular biology into a data-intensive discipline, requiring bioinformaticians to use high-performance computing resources and carry out data management and analysis tasks on large scale. Workflow systems can be useful to simplify construction of analysis pipelines that automate tasks, support reproducibility and provide measures for fault-tolerance. However, workflow systems can incur significant development and administration overhead so bioinformatics pipelines are often still built without them. We present the experiences with workflows and workflow systems within the bioinformatics community participating in a series of hackathons and workshops of the EU COST action SeqAhead. The organizations are working on similar problems, but we have addressed them with different strategies and solutions. This fragmentation of efforts is inefficient and leads to redundant and incompatible solutions. Based on our experiences we define a set of recommendations for future systems to enable efficient yet simple bioinformatics workflow construction and execution.

References

  1. Bioinformatics. 2004 Nov 22;20(17):3045-54 [PMID: 15201187]
  2. Genome Res. 2005 Oct;15(10):1451-5 [PMID: 16169926]
  3. BMC Bioinformatics. 2006 Nov 06;7:490 [PMID: 17087823]
  4. Brief Bioinform. 2008 May;9(3):220-31 [PMID: 18238804]
  5. Nucleic Acids Res. 2009 Feb;37(3):e18 [PMID: 19103659]
  6. Curr Protoc Mol Biol. 2010 Jan;Chapter 19:Unit 19.10.1-21 [PMID: 20069535]
  7. BMC Bioinformatics. 2010 Jan 19;11:35 [PMID: 20085625]
  8. Genome Biol. 2010;11(5):207 [PMID: 20441614]
  9. Nat Biotechnol. 2010 Jul;28(7):691-3 [PMID: 20622843]
  10. Genome Biol. 2010;11(8):R86 [PMID: 20738864]
  11. Bioinformatics. 2010 Sep 15;26(18):i540-6 [PMID: 20823319]
  12. Bioinformatics. 2010 Nov 1;26(21):2778-9 [PMID: 20847218]
  13. Nature. 2010 Oct 14;467(7317):775-7 [PMID: 20944712]
  14. Hum Mutat. 2011 Jan;32(1):25-32 [PMID: 20960467]
  15. BMC Bioinformatics. 2010 Dec 21;11 Suppl 12:S4 [PMID: 21210983]
  16. Bioinformatics. 2011 Apr 1;27(7):903-11 [PMID: 21278189]
  17. Bioinformatics. 2011 Aug 1;27(15):2159-60 [PMID: 21697132]
  18. BMC Genomics. 2011 Oct 14;12:507 [PMID: 21999641]
  19. Bioinformatics. 2012 Mar 15;28(6):876-7 [PMID: 22302568]
  20. Bioinformatics. 2012 Jun 1;28(11):1525-6 [PMID: 22500002]
  21. Curr Protoc Bioinformatics. 2012 Jun;Chapter 11:Unit11.9 [PMID: 22700313]
  22. BMC Bioinformatics. 2012 Aug 13;13:200 [PMID: 22888776]
  23. Bioinformatics. 2012 Oct 1;28(19):2520-2 [PMID: 22908215]
  24. Nature. 2013 Jun 13;498(7453):255-60 [PMID: 23765498]
  25. Gigascience. 2013 Jun 25;2(1):9 [PMID: 23800020]
  26. Science. 2013 Aug 2;341(6145):565-9 [PMID: 23908240]
  27. Nat Genet. 2013 Oct;45(10):1113-20 [PMID: 24071849]
  28. Cell. 2013 Sep 26;155(1):242-56 [PMID: 24074872]
  29. Bioinformatics. 2014 Jan 1;30(1):119-20 [PMID: 24149054]
  30. Nat Biotechnol. 2014 Sep;32(9):888-95 [PMID: 25150837]
  31. Nat Biotechnol. 2014 Sep;32(9):903-14 [PMID: 25150838]
  32. Bioinformatics. 2015 Jan 15;31(2):166-9 [PMID: 25260700]
  33. Gigascience. 2015 Jun 04;4:26 [PMID: 26045962]
  34. Mol Inform. 2012 May;31(5):334-41 [PMID: 27477263]

MeSH Term

Computational Biology
Electronic Data Processing
High-Throughput Nucleotide Sequencing
Reproducibility of Results
Workflow

Word Cloud

Created with Highcharts 10.0.0systemsbioinformaticsworkflowdata-intensiveanalysistaskscanconstructionpipelinesexperiencesworkflowssolutionsHigh-throughputtechnologiesnext-generationsequencingturnedmolecularbiologydisciplinerequiringbioinformaticiansusehigh-performancecomputingresourcescarrydatamanagementlargescaleWorkflowusefulsimplifyautomatesupportreproducibilityprovidemeasuresfault-toleranceHoweverincursignificantdevelopmentadministrationoverheadoftenstillbuiltwithoutpresentwithincommunityparticipatingserieshackathonsworkshopsEUCOSTactionSeqAheadorganizationsworkingsimilarproblemsaddresseddifferentstrategiesfragmentationeffortsinefficientleadsredundantincompatibleBaseddefinesetrecommendationsfutureenableefficientyetsimpleexecutionExperiencesautomating

Similar Articles

Cited By