SpLitteR: diploid genome assembly using TELL-Seq linked-reads and assembly graphs.

Ivan Tolstoganov, Zhoutao Chen, Pavel Pevzner, Anton Korobeynikov
Author Information
  1. Ivan Tolstoganov: Department of Mathematics, Science for Life Laboratory, Stockholm University, Stockholm, Sweden.
  2. Zhoutao Chen: Universal Sequencing Technology Corporation, Carlsbad, California, United States.
  3. Pavel Pevzner: Department of Computer Science and Engineering, University of California, San Diego, San Diego, California, United States.
  4. Anton Korobeynikov: Department of Statistical Modelling, Saint Petersburg State University, Saint Petersburg, Russia. ORCID

Abstract

Background: Recent advances in long-read sequencing technologies enabled accurate and contiguous assemblies of large genomes and metagenomes. However, even long and accurate high-fidelity (HiFi) reads do not resolve repeats that are longer than the read lengths. This limitation negatively affects the contiguity of diploid genome assemblies since two haplomes share many long identical regions. To generate the telomere-to-telomere assemblies of diploid genomes, biologists now construct their HiFi-based phased assemblies and use additional experimental technologies to transform them into more contiguous diploid assemblies. The barcoded linked-reads, generated using an inexpensive TELL-Seq technology, provide an attractive way to bridge unresolved repeats in phased assemblies of diploid genomes.
Results: We developed the SpLitteR tool for diploid genome assembly using linked-reads and assembly graphs and benchmarked it against state-of-the-art linked-read scaffolders ARKS and SLR-superscaffolder using human HG002 genome and sheep gut microbiome datasets. The benchmark showed that SpLitteR scaffolding results in 1.5-fold increase in NGA50 compared to the baseline LJA assembly and other scaffolders while introducing no additional misassemblies on the human dataset.
Conclusion: We developed the SpLitteR tool for assembly graph phasing and scaffolding using barcoded linked-reads. We benchmarked SpLitteR on assembly graphs produced by various long-read assemblers and have demonstrated that TELL-Seq reads facilitate phasing and scaffolding in these graphs. This benchmarking demonstrates that SpLitteR improves upon the state-of-the-art linked-read scaffolders in the accuracy and contiguity metrics. SpLitteR is implemented in C++ as a part of the freely available SPAdes package and is available at https://github.com/ablab/spades/releases/tag/splitter-preprint.

Keywords

References

  1. Genome Biol. 2021 Apr 12;22(1):101 [PMID: 33845884]
  2. BMC Bioinformatics. 2018 Jun 20;19(1):234 [PMID: 29925315]
  3. Bioinformatics. 2016 Jun 15;32(12):i216-i224 [PMID: 27307620]
  4. Nat Biotechnol. 2022 Jul;40(7):1075-1081 [PMID: 35228706]
  5. Nat Biotechnol. 2018 Oct 15;: [PMID: 30320765]
  6. Nat Biotechnol. 2019 Oct;37(10):1155-1162 [PMID: 31406327]
  7. Nat Biotechnol. 2020 Sep;38(9):1044-1053 [PMID: 32686750]
  8. Nat Methods. 2021 Feb;18(2):170-175 [PMID: 33526886]
  9. Microbiome. 2021 Jun 5;9(1):130 [PMID: 34090540]
  10. Bioinformatics. 2018 Jul 1;34(13):i142-i150 [PMID: 29949969]
  11. Genome Res. 2020 Sep;30(9):1291-1305 [PMID: 32801147]
  12. Nat Commun. 2023 Mar 13;14(1):1358 [PMID: 36914638]
  13. Methods Mol Biol. 2017;1551:191-205 [PMID: 28138848]
  14. Nat Biotechnol. 2023 Oct;41(10):1474-1482 [PMID: 36797493]
  15. Bioinformatics. 2019 Jul 15;35(14):i61-i70 [PMID: 31510642]
  16. Genome Res. 2017 May;27(5):757-767 [PMID: 28381613]
  17. Science. 2022 Apr;376(6588):44-53 [PMID: 35357919]
  18. BMC Bioinformatics. 2021 Mar 25;22(1):158 [PMID: 33765921]
  19. Nature. 2021 Apr;592(7856):737-746 [PMID: 33911273]
  20. Nat Biotechnol. 2019 May;37(5):540-546 [PMID: 30936562]
  21. Nat Biotechnol. 2021 Mar;39(3):302-308 [PMID: 33288906]
  22. Nat Biotechnol. 2022 Sep;40(9):1332-1335 [PMID: 35332338]
  23. Genome Res. 2020 Jun;30(6):898-909 [PMID: 32540955]
  24. Nat Methods. 2020 Nov;17(11):1103-1110 [PMID: 33020656]
  25. Brief Bioinform. 2023 Mar 19;24(2): [PMID: 36917471]

MeSH Term

Diploidy
Animals
Humans
Genome, Human
Sheep
Software
Sequence Analysis, DNA
Gastrointestinal Microbiome
High-Throughput Nucleotide Sequencing
Genome

Word Cloud

Created with Highcharts 10.0.0assemblyassembliesdiploidSpLitteRusinggenomelinked-readsgraphsgenomesTELL-Seqscaffoldersscaffoldinglong-readtechnologiesaccuratecontiguouslongreadsrepeatscontiguityphasedadditionalbarcodeddevelopedtoolbenchmarkedstate-of-the-artlinked-readhumangraphphasingavailableBackground:RecentadvancessequencingenabledlargemetagenomesHoweverevenhigh-fidelityHiFiresolvelongerreadlengthslimitationnegativelyaffectssincetwohaplomessharemanyidenticalregionsgeneratetelomere-to-telomerebiologistsnowconstructHiFi-baseduseexperimentaltransformgeneratedinexpensivetechnologyprovideattractivewaybridgeunresolvedResults:ARKSSLR-superscaffolderHG002sheepgutmicrobiomedatasetsbenchmarkshowedresults15-foldincreaseNGA50comparedbaselineLJAintroducingmisassembliesdatasetConclusion:producedvariousassemblersdemonstratedfacilitatebenchmarkingdemonstratesimprovesuponaccuracymetricsimplementedC++partfreelySPAdespackagehttps://githubcom/ablab/spades/releases/tag/splitter-preprintSpLitteR:AssemblyRepeatresolutionTell-seq

Similar Articles

Cited By