Time- and memory-efficient genome assembly with Raven.

Robert Vaser, Mile Šikić
Author Information
  1. Robert Vaser: Laboratory for Bioinformatics and Computational Biology, Faculty of Electrical Engineering and Computing, University of Zagreb, Zagreb, Croatia. ORCID
  2. Mile Šikić: Laboratory for Bioinformatics and Computational Biology, Faculty of Electrical Engineering and Computing, University of Zagreb, Zagreb, Croatia. mile_sikic@gis.a-star.edu.sg. ORCID

Abstract

Whole genome sequencing technologies are unable to invariably read DNA molecules intact, a shortcoming that assemblers try to resolve by stitching the obtained fragments back together. Here, we present methods for the improvement of de novo genome assembly from erroneous long reads incorporated into a tool called Raven. Raven maintains similar performance for various genomes and has accuracy on par with other assemblers that support third-generation sequencing data. It is one of the fastest options while having the lowest memory consumption on the majority of benchmarked datasets.

References

  1. Koren, S. et al. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res. 27, 722–736 (2017). [DOI: 10.1101/gr.215087.116]
  2. Chin, C.-S. et al. Phased diploid genome assembly with single-molecule real-time sequencing. Nat. Methods 13, 1050–1054 (2016). [DOI: 10.1038/nmeth.4035]
  3. Kolmogorov, M., Yuan, J., Lin, Y. & Pevzner, P. A. Assembly of long, error-prone reads using repeat graphs. Nat. Biotechnol. 37, 540–546 (2019). [DOI: 10.1038/s41587-019-0072-8]
  4. Li, H. Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics 32, 2103–2110 (2016). [DOI: 10.1093/bioinformatics/btw152]
  5. Shafin, K. et al. Nanopore sequencing and the Shasta toolkit enable efficient de novo assembly of eleven human genomes. Nat. Biotechnol. 38, 1044–1053 (2020). [DOI: 10.1038/s41587-020-0503-6]
  6. Ruan, J. & Li, H. Fast and accurate long-read assembly with wtdbg2. Nat. Methods 17, 155–158 (2020). [DOI: 10.1038/s41592-019-0669-3]
  7. Kamath, G. M., Shomorony, I., Xia, F., Courtade, T. A. & Tse, D. N. HINGE: long-read assembly achieves optimal repeat resolution. Genome Res. 27, 747–756 (2017). [DOI: 10.1101/gr.216465.116]
  8. Vaser, R., Sović, I., Nagarajan, N. & Šikić, M. Fast and accurate de novo genome assembly from long uncorrected reads. Genome Res. 27, 737–746 (2017). [DOI: 10.1101/gr.214270.116]
  9. Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990). [DOI: 10.1016/S0022-2836(05)80360-2]
  10. Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25, 1754–1760 (2009). [DOI: 10.1093/bioinformatics/btp324]
  11. Broder, A. Z. On the resemblance and containment of documents. In Proc. Compression and Complexity of SEQUENCES 1997 (cat. no. 97TB100171) (eds. Carpentieri, B. et al.) 21–29 (IEEE, 1997); https://doi.org/10.1109/SEQUEN.1997.666900
  12. Jain, C., Dilthey, A., Koren, S., Aluru, S. & Phillippy, A. M. A fast approximate algorithm for mapping long reads to large reference databases. In Research in Computational Molecular Biology (ed. Sahinalp, S. C.) 66–81 (Springer, 2017).
  13. Chin, C.-S. & Khalak, A. Human genome assembly in 100 minutes. Preprint at bioRxiv https://doi.org/10.1101/705616 (2019).
  14. Fruchterman, T. M. J. & Reingold, E. M. Graph drawing by force-directed placement. Softw. Pract. Exp. 21, 1129–1164 (1991). [DOI: 10.1002/spe.4380211102]
  15. Barnes, J. & Hut, P. A hierarchical O(NlogN) force-calculation algorithm. Nature 324, 446–449 (1986). [DOI: 10.1038/324446a0]
  16. Wick, R. R. & Holt, K. E. Benchmarking of long-read assemblers for prokaryote whole genome sequencing. F1000Res. 8, 2138 (2020). [DOI: 10.12688/f1000research.21782.3]
  17. Nurk, S. et al. HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads. Genome Res. 30, 1291–1305 (2020). [DOI: 10.1101/gr.263566.120]
  18. Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat. Methods 18, 170–175 (2021). [DOI: 10.1038/s41592-020-01056-5]
  19. Belser, C. et al. Chromosome-scale assemblies of plant genomes using nanopore long reads and optical maps. Nat. Plants 4, 879–887 (2018). [DOI: 10.1038/s41477-018-0289-4]
  20. Choi, J. Y. et al. Nanopore sequencing-based genome assembly and evolutionary genomics of circum-basmati rice. Genome Biol. 21, 21 (2020). [DOI: 10.1186/s13059-020-1938-2]
  21. Vaser, R. & Šikić, M. Yet another de novo genome assembler. In Proc. 2019 11th International Symposium on Image and Signal Processing and Analysis (ISPA) (eds. Lončarić, S. et al.) 147–151 (IEEE, 2019); https://doi.org/10.1109/ISPA.2019.8868909
  22. Simão, F. A., Waterhouse, R. M., Ioannidis, P., Kriventseva, E. V. & Zdobnov, E. M. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics 31, 3210–3212 (2015). [DOI: 10.1093/bioinformatics/btv351]
  23. Jain, M. et al. Nanopore sequencing and assembly of a human genome with ultra-long reads. Nat. Biotechnol. 36, 338–345 (2018). [DOI: 10.1038/nbt.4060]
  24. Mikheenko, A., Prjibelski, A., Saveliev, V., Antipov, D. & Gurevich, A. Versatile genome assembly evaluation with QUAST-LG. Bioinformatics 34, i142–i150 (2018). [DOI: 10.1093/bioinformatics/bty266]
  25. Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018). [DOI: 10.1093/bioinformatics/bty191]
  26. Vaser, R. & Sikic, M. 2021. Assemblies generated in the manuscript ‘Time and memory efficient genome assembly with Raven’. Zenodo https://doi.org/10.5281/zenodo.4443062
  27. Vaser, R. & Sikic, M. 2021. Raven source code used in the manuscript ‘Time and memory efficient genome assembly with Raven’. Zenodo https://doi.org/10.5281/zenodo.4672196

Grants

  1. IP-2018-01-5886/Hrvatska Zaklada za Znanost (Croatian Science Foundation)
  2. KK.01.1.1.01.0009/EC | European Regional Development Fund (Europski Fond za Regionalni Razvoj)

Word Cloud

Created with Highcharts 10.0.0genomeRavensequencingassemblersassemblyWholetechnologiesunableinvariablyreadDNAmoleculesintactshortcomingtryresolvestitchingobtainedfragmentsbacktogetherpresentmethodsimprovementdenovoerroneouslongreadsincorporatedtoolcalledmaintainssimilarperformancevariousgenomesaccuracyparsupportthird-generationdataonefastestoptionslowestmemoryconsumptionmajoritybenchmarkeddatasetsTime-memory-efficient

Similar Articles

Cited By