Introduction

MOTIVATION: Building the histogram of occurrences of every k-symbol long substring of nucleotide data is a standard step in many bioinformatics applications, known under the name of k-mer counting. Its applications include developing de Bruijn graph genome assemblers, fast multiple sequence alignment and repeat detection. The tremendous amounts of NGS data require fast algorithms for k-mer counting, preferably using moderate amounts of memory. RESULTS: We present a novel method for k-mer counting, on large datasets about twice faster than the strongest competitors (Jellyfish 2, KMC 1), using about 12 GB (or less) of RAM. Our disk-based method bears some resemblance to MSPKmerCounter, yet replacing the original minimizers with signatures (a carefully selected subset of all minimizers) and using (k, x)-mers allows to significantly reduce the I/O and a highly parallel overall architecture allows to achieve unprecedented processing speeds. For example, KMC 2 counts the 28-mers of a human reads collection with 44-fold coverage (106 GB of compressed size) in about 20 min, on a 6-core Intel i7 PC with an solid-state disk.

Publications

  1. KMC 2: fast and resource-frugal k-mer counting.
    Cite this
    Deorowicz S, Kokot M, Grabowski S, Debudaj-Grabysz A, 2015-05-01 - Bioinformatics (Oxford, England)
  2. Disk-based k-mer counting on a PC.
    Cite this
    Deorowicz S, Debudaj-Grabysz A, Grabowski S, 2013-01-01 - BMC bioinformatics

Credits

  1. Sebastian Deorowicz
    Developer

    Institute of Informatics, Silesian University of Technology, Poland

  2. Marek Kokot
    Developer

    Institute of Informatics, Silesian University of Technology, Poland

  3. Szymon Grabowski
    Developer

    Institute of Informatics, Silesian University of Technology, Poland

  4. Agnieszka Debudaj-Grabysz
    Investigator

    Institute of Informatics, Silesian University of Technology, Poland

Community Ratings

UsabilityEfficiencyReliabilityRated By
0 user
Sign in to rate
Summary
AccessionBT002420
Tool TypeApplication
Category
PlatformsLinux/Unix
TechnologiesC++
User InterfaceTerminal Command Line
Download Count0
Country/RegionPoland
Submitted ByAgnieszka Debudaj-Grabysz