CARE: context-aware sequencing read error correction.

Felix Kallenborn, Andreas Hildebrandt, Bertil Schmidt
Author Information
  1. Felix Kallenborn: Department of Computer Science, Johannes Gutenberg University, Mainz 55122, Germany. ORCID
  2. Andreas Hildebrandt: Department of Computer Science, Johannes Gutenberg University, Mainz 55122, Germany.
  3. Bertil Schmidt: Department of Computer Science, Johannes Gutenberg University, Mainz 55122, Germany.

Abstract

MOTIVATION: Error correction is a fundamental pre-processing step in many Next-Generation Sequencing (NGS) pipelines, in particular for de novo genome assembly. However, existing error correction methods either suffer from high false-positive rates since they break reads into independent k-mers or do not scale efficiently to large amounts of sequencing reads and complex genomes.
RESULTS: We present CARE-an alignment-based scalable error correction algorithm for Illumina data using the concept of minhashing. Minhashing allows for efficient similarity search within large sequencing read collections which enables fast computation of high-quality multiple alignments. Sequencing errors are corrected by detailed inspection of the corresponding alignments. Our performance evaluation shows that CARE generates significantly fewer false-positive corrections than state-of-the-art tools (Musket, SGA, BFC, Lighter, Bcool, Karect) while maintaining a competitive number of true positives. When used prior to assembly it can achieve superior de novo assembly results for a number of real datasets. CARE is also the first multiple sequence alignment-based error corrector that is able to process a human genome Illumina NGS dataset in only 4 h on a single workstation using GPU acceleration.
AVAILABILITYAND IMPLEMENTATION: CARE is open-source software written in C++ (CPU version) and in CUDA/C++ (GPU version). It is licensed under GPLv3 and can be downloaded at https://github.com/fkallen/CARE.
SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

MeSH Term

Algorithms
High-Throughput Nucleotide Sequencing
Humans
Sequence Alignment
Sequence Analysis, DNA
Software

Word Cloud

Created with Highcharts 10.0.0correctionerrorassemblysequencingCARESequencingNGSdenovogenomefalse-positivereadslargealignment-basedIlluminadatausingreadmultiplealignmentsnumbercanGPUversionMOTIVATION:Errorfundamentalpre-processingstepmanyNext-GenerationpipelinesparticularHoweverexistingmethodseithersufferhighratessincebreakindependentk-mersscaleefficientlyamountscomplexgenomesRESULTS:presentCARE-anscalablealgorithmconceptminhashingMinhashingallowsefficientsimilaritysearchwithincollectionsenablesfastcomputationhigh-qualityerrorscorrecteddetailedinspectioncorrespondingperformanceevaluationshowsgeneratessignificantlyfewercorrectionsstate-of-the-arttoolsMusketSGABFCLighterBcoolKarectmaintainingcompetitivetruepositivesusedpriorachievesuperiorresultsrealdatasetsalsofirstsequencecorrectorableprocesshumandataset4 hsingleworkstationaccelerationAVAILABILITYANDIMPLEMENTATION:open-sourcesoftwarewrittenC++CPUCUDA/C++licensedGPLv3downloadedhttps://githubcom/fkallen/CARESUPPLEMENTARYINFORMATION:SupplementaryavailableBioinformaticsonlineCARE:context-aware

Similar Articles

Cited By