Introduction

MOTIVATION: Next-generation sequencing technologies are revolutionizing medicine. Data from sequencing technologies are typically represented as a string of bases, an associated sequence of per-base quality scores and other metadata, and in aggregate can require a large amount of space. The quality scores show how accurate the bases are with respect to the sequencing process, that is, how confident the sequencer is of having called them correctly, and are the largest component in datasets in which they are retained. Previous research has examined how to store sequences of bases effectively; here we add to that knowledge by examining methods for compressing quality scores. The quality values originate in a continuous domain, and so if a fidelity criterion is introduced, it is possible to introduce flexibility in the way these values are represented, allowing lossy compression over the quality score data. RESULTS: We present existing compression options for quality score data, and then introduce two new lossy techniques. Experiments measuring the trade-off between compression ratio and information loss are reported, including quantifying the effect of lossy representations on a downstream application that carries out single nucleotide polymorphism and insert/deletion detection. The new methods are demonstrably superior to other techniques when assessed against the spectrum of possible trade-offs between storage required and fidelity of representation. AVAILABILITY AND IMPLEMENTATION: An implementation of the methods described here is available at https://github.com/rcanovas/libCSAM. CONTACT: rcanovas@student.unimelb.edu.au SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Publications

  1. Lossy compression of quality scores in genomic data.
    Cite this
    Cánovas R, Moffat A, Turpin A, 2014-08-01 - Bioinformatics (Oxford, England)
  2. CSAM: Compressed SAM format.
    Cite this
    Cánovas R, Moffat A, Turpin A, 2016-12-01 - Bioinformatics (Oxford, England)

Credits

  1. Rodrigo Cánovas
    Developer

    NICTA Victoria Research Laboratory, Department of Computing and Information Systems, Australia

  2. Alistair Moffat
    Developer

    NICTA Victoria Research Laboratory, Department of Computing and Information Systems, Australia

  3. Andrew Turpin
    Investigator

    NICTA Victoria Research Laboratory, Department of Computing and Information Systems, Australia

Community Ratings

UsabilityEfficiencyReliabilityRated By
0 user
Sign in to rate
Summary
AccessionBT006126
Tool TypeApplication
Category
PlatformsLinux/Unix
TechnologiesC++
User InterfaceTerminal Command Line
Download Count0
Country/RegionAustralia
Submitted ByAndrew Turpin