SUP: a probabilistic framework to propagate genome sequence uncertainty, with applications.

Devan Becker, David Champredon, Connor Chato, Gopi Gugan, Art Poon
Author Information
  1. Devan Becker: Department of Pathology and Laboratory Medicine, Schulich School of Medicine and Dentistry, Western University, London, Ontario, Canada. ORCID
  2. David Champredon: Public Health Agency of Canada, National Microbiology Laboratory, Public Health Risk Sciences Division, Guelph, Ontario, Canada.
  3. Connor Chato: Department of Pathology and Laboratory Medicine, Schulich School of Medicine and Dentistry, Western University, London, Ontario, Canada.
  4. Gopi Gugan: Department of Pathology and Laboratory Medicine, Schulich School of Medicine and Dentistry, Western University, London, Ontario, Canada.
  5. Art Poon: Department of Pathology and Laboratory Medicine, Schulich School of Medicine and Dentistry, Western University, London, Ontario, Canada. ORCID

Abstract

Genetic sequencing is subject to many different types of errors, but most analyses treat the resultant sequences as if they are known without error. Next generation sequencing methods rely on significantly larger numbers of reads than previous sequencing methods in exchange for a loss of accuracy in each individual read. Still, the coverage of such machines is imperfect and leaves uncertainty in many of the base calls. In this work, we demonstrate that the uncertainty in sequencing techniques will affect downstream analysis and propose a straightforward method to propagate the uncertainty. Our method (which we have dubbed Sequence Uncertainty Propagation, or SUP) uses a probabilistic matrix representation of individual sequences which incorporates base quality scores as a measure of uncertainty that naturally lead to resampling and replication as a framework for uncertainty propagation. With the matrix representation, resampling possible base calls according to quality scores provides a bootstrap- or prior distribution-like first step towards genetic analysis. Analyses based on these re-sampled sequences will include a more complete evaluation of the error involved in such analyses. We demonstrate our resampling method on SARS-CoV-2 data. The resampling procedures add a linear computational cost to the analyses, but the large impact on the variance in downstream estimates makes it clear that ignoring this uncertainty may lead to overly confident conclusions. We show that SARS-CoV-2 lineage designations via Pangolin are much less certain than the bootstrap support reported by Pangolin would imply and the clock rate estimates for SARS-CoV-2 are much more variable than reported.

References

  1. Nat Biotechnol. 2009 Nov;27(11):1013-23 [PMID: 19898456]
  2. Nucleic Acids Res. 1982 May 11;10(9):2997-3011 [PMID: 7048259]
  3. Proc Natl Acad Sci U S A. 1986 Jan;83(1):4-8 [PMID: 2417239]
  4. Genetics. 2011 Mar;187(3):903-17 [PMID: 21212231]
  5. Nat Rev Genet. 2014 Jan;15(1):56-62 [PMID: 24322726]
  6. Bioinformatics. 2010 Jan 1;26(1):38-45 [PMID: 19861355]
  7. Genome Res. 1998 Mar;8(3):251-9 [PMID: 9521928]
  8. Mol Biol Evol. 2013 Oct;30(10):2249-62 [PMID: 23906727]
  9. Bioinformatics. 2017 Aug 01;33(15):2322-2329 [PMID: 28334373]
  10. Genetics. 2013 Nov;195(3):979-92 [PMID: 23979584]
  11. Front Microbiol. 2021 May 20;12:673855 [PMID: 34093495]
  12. Virus Res. 2020 Oct 2;287:198098 [PMID: 32687861]
  13. Virus Evol. 2018 Jan 08;4(1):vex042 [PMID: 29340210]
  14. Clin Infect Dis. 2022 Jan 29;74(2):237-245 [PMID: 33906227]
  15. G3 (Bethesda). 2014 Nov 04;4(12):2545-52 [PMID: 25378476]
  16. Genome Biol. 2016 May 05;17:86 [PMID: 27149953]
  17. Nat Rev Genet. 2016 May 17;17(6):333-51 [PMID: 27184599]
  18. Nat Genet. 2011 May;43(5):491-8 [PMID: 21478889]
  19. Int J Syst Evol Microbiol. 2005 Mar;55(Pt 2):903-906 [PMID: 15774683]
  20. Genome Res. 2008 May;18(5):763-70 [PMID: 18212088]
  21. Virus Evol. 2020 Aug 19;6(2):veaa061 [PMID: 33235813]
  22. Genome Res. 1998 Mar;8(3):186-94 [PMID: 9521922]
  23. Appl Bioinformatics. 2002;1(3):111-9 [PMID: 15130839]
  24. Nat Rev Genet. 2018 May;19(5):269-285 [PMID: 29576615]
  25. Genome Res. 2008 Nov;18(11):1851-8 [PMID: 18714091]
  26. Nucleic Acids Res. 1990 Oct 25;18(20):6097-100 [PMID: 2172928]
  27. Virus Evol. 2021 Mar 14;7(1):veaa102 [PMID: 33747543]
  28. Genome Res. 2009 Jun;19(6):1124-32 [PMID: 19420381]
  29. Curr Opin Virol. 2011 Nov;1(5):413-8 [PMID: 22440844]
  30. Trends Genet. 2015 Feb;31(2):61-6 [PMID: 25579994]
  31. Genome Biol. 2017 Sep 19;18(1):178 [PMID: 28927434]
  32. Nucleic Acids Res. 2004 Sep 30;32(17):5183-91 [PMID: 15459287]
  33. Bioinformatics. 2009 Aug 15;25(16):2078-9 [PMID: 19505943]
  34. Bioinformatics. 2002 Nov;18(11):1494-9 [PMID: 12424121]
  35. Nat Microbiol. 2020 Nov;5(11):1403-1407 [PMID: 32669681]
  36. BMC Med Genomics. 2018 Apr 20;11(Suppl 2):28 [PMID: 29697369]
  37. Genome Biol. 2016 Apr 15;17:69 [PMID: 27083415]

Word Cloud

Created with Highcharts 10.0.0uncertaintysequencingresamplinganalysessequencesbasemethodSARS-CoV-2manyerrormethodsindividualcallsdemonstratewilldownstreamanalysispropagateprobabilisticmatrixrepresentationqualityscoresleadframeworkestimatesPangolinmuchreportedGeneticsubjectdifferenttypeserrorstreatresultantknownwithoutNextgenerationrelysignificantlylargernumbersreadspreviousexchangelossaccuracyreadStillcoveragemachinesimperfectleavesworktechniquesaffectproposestraightforwarddubbedSequenceUncertaintyPropagationSUPusesincorporatesmeasurenaturallyreplicationpropagationpossibleaccordingprovidesbootstrap-priordistribution-likefirststeptowardsgeneticAnalysesbasedre-sampledincludecompleteevaluationinvolveddataproceduresaddlinearcomputationalcostlargeimpactvariancemakesclearignoringmayoverlyconfidentconclusionsshowlineagedesignationsvialesscertainbootstrapsupportimplyclockratevariableSUP:genomesequenceapplications

Similar Articles

Cited By