Anchorage Accurately Assembles Anchor-Flanked Synthetic Long Reads.

Xiaofei Carl Zang, Xiang Li, Kyle Metcalfe, Tuval Ben-Yehezkel, Ryan Kelley, Mingfu Shao
Author Information
  1. Xiaofei Carl Zang: Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA, USA.
  2. Xiang Li: Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA, USA.
  3. Kyle Metcalfe: Element Biosciences, San Diego, CA, USA.
  4. Tuval Ben-Yehezkel: Element Biosciences, San Diego, CA, USA.
  5. Ryan Kelley: Element Biosciences, San Diego, CA, USA.
  6. Mingfu Shao: Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA, USA Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA, USA.

Abstract

Modern sequencing technologies allow for the addition of short-sequence tags, known as anchors, to both ends of a captured molecule. Anchors are useful in assembling the full-length sequence of a captured molecule as they can be used to accurately determine the endpoints. One representative of such anchor-enabled technology is LoopSeq Solo, a synthetic long read (SLR) sequencing protocol. LoopSeq Solo also achieves ultra-high sequencing depth and high purity of short reads covering the entire captured molecule. Despite the availability of many assembly methods, constructing full-length sequence from these anchor-enabled, ultra-high coverage sequencing data remains challenging due to the complexity of the underlying assembly graphs and the lack of specific algorithms leveraging anchors. We present Anchorage, a novel assembler that performs anchor-guided assembly for ultra-high-depth sequencing data. Anchorage starts with a kmer-based approach for precise estimation of molecule lengths. It then formulates the assembly problem as finding an optimal path that connects the two nodes determined by anchors in the underlying compact de Bruijn graph. The optimality is defined as maximizing the weight of the smallest node while matching the estimated sequence length. Anchorage uses a modified dynamic programming algorithm to efficiently find the optimal path. Through both simulations and real data, we show that Anchorage outperforms existing assembly methods, particularly in the presence of sequencing artifacts. Anchorage fills the gap in assembling anchor-enabled data. We anticipate its broad use as anchor-enabled sequencing technologies become prevalent. Anchorage is freely available at https://github.com/Shao-Group/anchorage; the scripts and documents that can reproduce all experiments in this manuscript are available at https://github.com/Shao-Group/anchorage-test.

Keywords

References

  1. Nat Protoc. 2014 Jan;9(1):171-81 [PMID: 24385147]
  2. Genome Res. 2024 Oct 11;34(9):1365-1370 [PMID: 39060030]
  3. Bioinformatics. 2017 Jul 15;33(14):2202-2204 [PMID: 28369201]
  4. Bioinformatics. 2014 Aug 1;30(15):2114-20 [PMID: 24695404]
  5. Nat Methods. 2010 Feb;7(2):119-22 [PMID: 20081835]
  6. Genome Biol. 2022 Jun 29;23(1):143 [PMID: 35768836]
  7. Genome Res. 2008 May;18(5):821-9 [PMID: 18349386]
  8. Microbiol Resour Announc. 2021 Nov 24;10(47):e0081821 [PMID: 34817215]
  9. Nat Biotechnol. 2020 Jun;38(6):708-714 [PMID: 32518404]
  10. Bioinformatics. 2015 Sep 1;31(17):2778-84 [PMID: 25926345]
  11. Microbiome. 2021 Jun 5;9(1):130 [PMID: 34090540]
  12. Genome Biol. 2014;15(11):517 [PMID: 25406369]
  13. Nat Comput Sci. 2022 Mar;2(3):148-152 [PMID: 36713932]
  14. Genome Biol. 2023 Aug 28;24(1):197 [PMID: 37641111]
  15. Bioinformatics. 2015 May 15;31(10):1674-6 [PMID: 25609793]
  16. Nat Biotechnol. 2019 May;37(5):540-546 [PMID: 30936562]
  17. PLoS One. 2016 Jan 20;11(1):e0147229 [PMID: 26789840]
  18. Bioinformatics. 2009 Nov 1;25(21):2872-7 [PMID: 19528083]
  19. Genome Res. 2017 May;27(5):722-736 [PMID: 28298431]
  20. J Comput Biol. 2012 May;19(5):455-77 [PMID: 22506599]
  21. Nat Methods. 2021 Feb;18(2):170-175 [PMID: 33526886]
  22. Commun Biol. 2021 Apr 27;4(1):506 [PMID: 33907296]
  23. Bioinformatics. 2013 Jan 1;29(1):15-21 [PMID: 23104886]
  24. Bioinformatics. 2013 Apr 15;29(8):1072-5 [PMID: 23422339]
  25. Proc Natl Acad Sci U S A. 2011 Jan 25;108(4):1513-8 [PMID: 21187386]
  26. BMC Bioinformatics. 2024 May 10;25(1):186 [PMID: 38730374]
  27. ACM BCB. 2023 Sep;2023: [PMID: 38045531]

Grants

  1. R01 HG011065/NHGRI NIH HHS

Word Cloud

Created with Highcharts 10.0.0sequencingassemblyAnchoragemoleculesequenceanchor-enableddataanchorscapturedLoopSeqtechnologiesassemblingfull-lengthcanSolosyntheticlongultra-highreadsmethodsunderlyinganchor-guidedoptimalpathdeBruijngraphavailablehttps://githubModernallowadditionshort-sequencetagsknownendsAnchorsusefulusedaccuratelydetermineendpointsOnerepresentativetechnologyreadSLRprotocolalsoachievesdepthhighpurityshortcoveringentireDespiteavailabilitymanyconstructingcoverageremainschallengingduecomplexitygraphslackspecificalgorithmsleveragingpresentnovelassemblerperformsultra-high-depthstartskmer-basedapproachpreciseestimationlengthsformulatesproblemfindingconnectstwonodesdeterminedcompactoptimalitydefinedmaximizingweightsmallestnodematchingestimatedlengthusesmodifieddynamicprogrammingalgorithmefficientlyfindsimulationsrealshowoutperformsexistingparticularlypresenceartifactsfillsgapanticipatebroadusebecomeprevalentfreelycom/Shao-Group/anchoragescriptsdocumentsreproduceexperimentsmanuscriptcom/Shao-Group/anchorage-testAccuratelyAssemblesAnchor-FlankedSyntheticLongReadsAppliedcomputing���MolecularanalysisGenome

Similar Articles

Cited By

No available data.