HSRA: Hadoop-based spliced read aligner for RNA sequencing data.

Roberto R Expósito, Jorge González-Domínguez, Juan Touriño
Author Information
  1. Roberto R Expósito: Computer Architecture Group, Universidade da Coruña, Campus de Elviña, 15071 A Coruña, Spain. ORCID
  2. Jorge González-Domínguez: Computer Architecture Group, Universidade da Coruña, Campus de Elviña, 15071 A Coruña, Spain.
  3. Juan Touriño: Computer Architecture Group, Universidade da Coruña, Campus de Elviña, 15071 A Coruña, Spain. ORCID

Abstract

Nowadays, the analysis of transcriptome sequencing (RNA-seq) data has become the standard method for quantifying the levels of gene expression. In RNA-seq experiments, the mapping of short reads to a reference genome or transcriptome is considered a crucial step that remains as one of the most time-consuming. With the steady development of Next Generation Sequencing (NGS) technologies, unprecedented amounts of genomic data introduce significant challenges in terms of storage, processing and downstream analysis. As cost and throughput continue to improve, there is a growing need for new software solutions that minimize the impact of increasing data volume on RNA read alignment. In this work we introduce HSRA, a Big Data tool that takes advantage of the MapReduce programming model to extend the multithreading capabilities of a state-of-the-art spliced read aligner for RNA-seq data (HISAT2) to distributed memory systems such as multi-core clusters or cloud platforms. HSRA has been built upon the Hadoop MapReduce framework and supports both single- and paired-end reads from FASTQ/FASTA datasets, providing output alignments in SAM format. The design of HSRA has been carefully optimized to avoid the main limitations and major causes of inefficiency found in previous Big Data mapping tools, which cannot fully exploit the raw performance of the underlying aligner. On a 16-node multi-core cluster, HSRA is on average 2.3 times faster than previous Hadoop-based tools. Source code in Java as well as a user's guide are publicly available for download at http://hsra.dec.udc.es.

References

  1. Genome Biol. 2009;10(11):R134 [PMID: 19930550]
  2. Nat Rev Genet. 2009 Jan;10(1):57-63 [PMID: 19015660]
  3. Nat Biotechnol. 2016 May;34(5):525-7 [PMID: 27043002]
  4. PLoS One. 2013 Aug 23;8(8):e72614 [PMID: 24009693]
  5. Bioinformatics. 2009 Aug 1;25(15):1966-7 [PMID: 19497933]
  6. Genome Res. 2008 Nov;18(11):1851-8 [PMID: 18714091]
  7. Nat Methods. 2012 Mar 04;9(4):357-9 [PMID: 22388286]
  8. Bioinformatics. 2012 Mar 15;28(6):876-7 [PMID: 22302568]
  9. Bioinformatics. 2015 Dec 15;31(24):4003-5 [PMID: 26323715]
  10. BMC Res Notes. 2011 Jun 06;4:171 [PMID: 21645377]
  11. Genome Res. 2008 Sep;18(9):1509-17 [PMID: 18550803]
  12. Genome Biol. 2013 Apr 25;14(4):R36 [PMID: 23618408]
  13. Nat Methods. 2008 Jul;5(7):621-8 [PMID: 18516045]
  14. Bioinformatics. 2012 Mar 1;28(5):721-3 [PMID: 22257667]
  15. Bioinformatics. 2013 Jan 1;29(1):15-21 [PMID: 23104886]
  16. Brief Bioinform. 2014 Jul;15(4):637-47 [PMID: 23396756]
  17. Genome Res. 2010 Sep;20(9):1297-303 [PMID: 20644199]
  18. Nat Methods. 2017 Feb;14(2):135-139 [PMID: 27941783]
  19. Nat Methods. 2009 Nov;6(11 Suppl):S22-32 [PMID: 19844228]
  20. Bioinformatics. 2015 Aug 1;31(15):2482-8 [PMID: 25819078]
  21. Bioinformatics. 2009 Aug 15;25(16):2078-9 [PMID: 19505943]
  22. PLoS One. 2016 May 16;11(5):e0155461 [PMID: 27182962]
  23. Biomed Inform Insights. 2016 Jan 19;8:1-10 [PMID: 26843812]
  24. Genome Biol. 2010;11(8):R83 [PMID: 20701754]
  25. Front Genet. 2011 Jul 07;2:46 [PMID: 22303342]
  26. Nucleic Acids Res. 2010 Oct;38(18):e178 [PMID: 20802226]
  27. PLoS One. 2017 Mar 30;12(3):e0174575 [PMID: 28358893]
  28. Drug Discov Today. 2017 Apr;22(4):712-717 [PMID: 28163155]
  29. Bioinformatics. 2017 Sep 1;33(17):2762-2764 [PMID: 28475668]
  30. Hum Mol Genet. 2010 Oct 15;19(R2):R131-6 [PMID: 20858594]
  31. Bioinformatics. 2010 Apr 1;26(7):873-81 [PMID: 20147302]
  32. Nat Methods. 2015 Apr;12(4):357-60 [PMID: 25751142]
  33. Bioinformatics. 2009 Jun 1;25(11):1363-9 [PMID: 19357099]
  34. Cold Spring Harb Protoc. 2015 Apr 13;2015(11):951-69 [PMID: 25870306]
  35. BMC Bioinformatics. 2008 Feb 28;9:128 [PMID: 18307793]
  36. Genome Biol. 2009;10(3):R25 [PMID: 19261174]
  37. Bioinformatics. 2013 Dec 1;29(23):3014-9 [PMID: 24021384]
  38. Bioinformatics. 2017 May 15;33(10):1575-1577 [PMID: 28093410]
  39. Nat Methods. 2017 Apr;14(4):417-419 [PMID: 28263959]
  40. IEEE/ACM Trans Comput Biol Bioinform. 2014 Mar-Apr;11(2):375-88 [PMID: 26355784]
  41. Bioinformatics. 2009 Jul 15;25(14):1754-60 [PMID: 19451168]
  42. J Biomed Inform. 2013 Oct;46(5):774-81 [PMID: 23872175]
  43. Bioinformatics. 2011 Aug 1;27(15):2159-60 [PMID: 21697132]

MeSH Term

Big Data
High-Throughput Nucleotide Sequencing
RNA Folding
Sequence Alignment
Sequence Analysis, RNA
Software