ntCard

Introduction

Many bioinformatics algorithms are designed for the analysis of sequences of some uniform length, conventionally referred to as k-mers. These include de Bruijn graph assembly methods and sequence alignment tools. An efficient algorithm to enumerate the number of unique k-mers, or even better, to build a histogram of k-mer frequencies would be desirable for these tools and their downstream analysis pipelines. Among other applications, estimated frequencies can be used to predict genome sizes, measure sequencing error rates, and tune runtime parameters for analysis tools. However, calculating a k-mer histogram from large volumes of sequencing data is a challenging task.Here, we present ntCard, a streaming algorithm for estimating the frequencies of k-mers in genomics datasets. At its core, ntCard uses the ntHash algorithm to efficiently compute hash values for streamed sequences. It then samples the calculated hash values to build a reduced representation multiplicity table describing the sample distribution. Finally, it uses a statistical model to reconstruct the population distribution from the sample distribution. We have compared the performance of ntCard and other cardinality estimation algorithms. We used three datasets of 480 GB, 500 GB and 2.4 TB in size, where the first two representing whole genome shotgun sequencing experiments on the human genome and the last one on the white spruce genome. Results show ntCard estimates k-mer coverage frequencies >15× faster than the state-of-the-art algorithms, using similar amount of memory, and with higher accuracy rates. Thus, our benchmarks demonstrate ntCard as a potentially enabling technology for large-scale genomics applications.ntCard is written in C ++ and is released under the GPL license. It is freely available at https://github.com/bcgsc/ntCard CONTACT: hmohamadi@bcgsc.ca or ibirol@bcgsc.caSupplementary information: Supplementary data are available at Bioinformatics online.

Publications

ntCard: a streaming algorithm for cardinality estimation in genomics data.
Cite this
Mohamadi H, Khan H, Birol I, 2017-01-01 - Bioinformatics (Oxford, England)

Credits

Hamid Mohamadi
Developer
Faculty of Science, University of British Columbia, Canada
Hamza Khan
Developer
Faculty of Science, University of British Columbia, Canada
Inanc Birol
Investigator
Faculty of Science, University of British Columbia, Canada

Community Ratings

Usability	Efficiency	Reliability	Rated By
			0 user
Sign in to rate

Summary

Accession	BT001656
Tool Type	Application
Category
Platforms	Linux/Unix
Technologies	C++
User Interface	Terminal Command Line
Download Count	0
Country/Region	Canada
Submitted By	Inanc Birol

ntCard

Introduction

Publications

ntCard: a streaming algorithm for cardinality estimation in genomics data. Cite this

Credits

Community Ratings

ntCard: a streaming algorithm for cardinality estimation in genomics data.
Cite this