Introduction

The increasing throughput of sequencing technologies offers new applications and challenges for computational biology. In many of those applications, sequencing errors need to be corrected. This is particularly important when sequencing reads from an unknown reference such as random DNA barcodes. In this case, error correction can be done by performing a pairwise comparison of all the barcodes, which is a computationally complex problem.Here, we address this challenge and describe an exact algorithm to determine which pairs of sequences lie within a given Levenshtein distance. For error correction or redundancy reduction purposes, matched pairs are then merged into clusters of similar sequences. The efficiency of starcode is attributable to the poucet search, a novel implementation of the Needleman-Wunsch algorithm performed on the nodes of a trie. On the task of matching random barcodes, starcode outperforms sequence clustering algorithms in both speed and precision.The C source code is available at http://github.com/gui11aume/starcode.

Publications

  1. Starcode: sequence clustering based on all-pairs search.
    Cite this
    Zorita E, Cuscó P, Filion GJ, 2015-06-01 - Bioinformatics (Oxford, England)

Credits

  1. Eduard Zorita
    Developer

    Genome Architecture, Gene Regulation, Spain

  2. Pol Cuscó
    Developer

    Genome Architecture, Gene Regulation, Spain

  3. Guillaume J Filion
    Investigator

    Genome Architecture, Gene Regulation, Spain

Community Ratings

UsabilityEfficiencyReliabilityRated By
0 user
Sign in to rate
Summary
AccessionBT005905
Tool TypeApplication
Category
PlatformsLinux/Unix
TechnologiesC
User InterfaceTerminal Command Line
Download Count0
Country/RegionSpain
Submitted ByGuillaume J Filion