The Statistics of -mers from a Sequence Undergoing a Simple Mutation Process Without Spurious Matches.

Antonio Blanca, Robert S Harris, David Koslicki, Paul Medvedev
Author Information
  1. Antonio Blanca: Department of Computer Science and Engineering, and The Pennsylvania State University, University Park, Pennsylvania, USA.
  2. Robert S Harris: Department of Biology, The Pennsylvania State University, University Park, Pennsylvania, USA.
  3. David Koslicki: Department of Computer Science and Engineering, and The Pennsylvania State University, University Park, Pennsylvania, USA.
  4. Paul Medvedev: Department of Computer Science and Engineering, and The Pennsylvania State University, University Park, Pennsylvania, USA. ORCID

Abstract

-mer-based methods are widely used in bioinformatics, but there are many gaps in our understanding of their statistical properties. Here, we consider the simple model where a sequence (e.g., a genome or a read) undergoes a simple mutation process through which each nucleotide is mutated independently with some probability , under the assumption that there are no spurious -mer matches. How does this process affect the -mers of ? We derive the expectation and variance of the number of mutated -mers and of the number of islands (a maximal interval of mutated -mers) and oceans (a maximal interval of nonmutated -mers). We then derive hypothesis tests and confidence intervals (CIs) for given an observed number of mutated -mers, or, alternatively, given the Jaccard similarity (with or without MinHash). We demonstrate the usefulness of our results using a few select applications: obtaining a CI to supplement the Mash distance point estimate, filtering out reads during alignment by Minimap2, and rating long-read alignments to a de Bruijn graph by Jabba.

Keywords

MeSH Term

Algorithms
Base Sequence
Computational Biology
Confidence Intervals
Genomics
Humans
Models, Genetic
Mutation
Sequence Alignment
Sequence Analysis, DNA
Software

Word Cloud

Created with Highcharts 10.0.0-mersmutatedprocessnumbersimplemutationderivemaximalintervalconfidenceintervalsgivenJaccardsimilarityMinHash-mer-basedmethodswidelyusedbioinformaticsmanygapsunderstandingstatisticalpropertiesconsidermodelsequenceeggenomereadundergoesnucleotideindependentlyprobabilityassumptionspurious-mermatchesaffect?expectationvarianceislandsoceansnonmutatedhypothesistestsCIsobservedalternativelywithoutdemonstrateusefulnessresultsusingselectapplications:obtainingCIsupplementMashdistancepointestimatefilteringreadsalignmentMinimap2ratinglong-readalignmentsdeBruijngraphJabbaStatisticsSequenceUndergoingSimpleMutationProcessWithoutSpuriousMatchesk-merssketching

Similar Articles

Cited By