The Statistics of -mers from a Sequence Undergoing a Simple Mutation Process Without Spurious Matches.

Advanced Search

Antonio Blanca, Robert S Harris, David Koslicki, Paul Medvedev

Author Information

Antonio Blanca: Department of Computer Science and Engineering, and The Pennsylvania State University, University Park, Pennsylvania, USA.
Robert S Harris: Department of Biology, The Pennsylvania State University, University Park, Pennsylvania, USA.
David Koslicki: Department of Computer Science and Engineering, and The Pennsylvania State University, University Park, Pennsylvania, USA.
Paul Medvedev: Department of Computer Science and Engineering, and The Pennsylvania State University, University Park, Pennsylvania, USA. ORCID

PMID: 35108101 DOI: 10.1089/cmb.2021.0431

-mer-based methods are widely used in bioinformatics, but there are many gaps in our understanding of their statistical properties. Here, we consider the simple model where a sequence (e.g., a genome or a read) undergoes a simple mutation process through which each nucleotide is mutated independently with some probability , under the assumption that there are no spurious -mer matches. How does this process affect the -mers of ? We derive the expectation and variance of the number of mutated -mers and of the number of islands (a maximal interval of mutated -mers) and oceans (a maximal interval of nonmutated -mers). We then derive hypothesis tests and confidence intervals (CIs) for given an observed number of mutated -mers, or, alternatively, given the Jaccard similarity (with or without MinHash). We demonstrate the usefulness of our results using a few select applications: obtaining a CI to supplement the Mash distance point estimate, filtering out reads during alignment by Minimap2, and rating long-read alignments to a de Bruijn graph by Jabba.

Jaccard similarity MinHash confidence intervals k-mers mutation process sketching

Algorithms

Base Sequence

Computational Biology

Confidence Intervals

Genomics

Humans

Models, Genetic

Mutation

Sequence Alignment

Sequence Analysis, DNA

Software

Evaluation Study Journal Article Research Support, U.S. Gov't, Non-P.H.S.

OpenLB
Open Library of Bioscience