PgRC2: engineering the compression of sequencing reads.

Advanced Search

Tomasz M Kowalski, Szymon Grabowski

Author Information

Tomasz M Kowalski: Institute of Applied Computer Science, Lodz University of Technology, Lodz 90-924, Poland. ORCID
Szymon Grabowski: Institute of Applied Computer Science, Lodz University of Technology, Lodz 90-924, Poland. ORCID

PMID: 40037801 DOI: 10.1093/bioinformatics/btaf101

SUMMARY: The FASTQ format remains at the heart of high-throughput sequencing. Despite advances in specialized FASTQ compressors, they are still imperfect in terms of practical performance tradeoffs. We present a multi-threaded version of Pseudogenome-based Read Compressor (PgRC), an in-memory algorithm for compressing the DNA stream, based on the idea of approximating the shortest common superstring over high-quality reads. Redundancy in the obtained string is efficiently removed by using a compact temporary representation. The current version, v2.0, preserves the compression ratio of the previous one, reducing the compression (resp. decompression) time by a factor of 8-9 (resp. 2-2.5) on a 14-core/28-thread machine.
AVAILABILITY AND IMPLEMENTATION: PgRC��2.0 can be downloaded from https://github.com/kowallus/PgRC and https://zenodo.org/records/14882486 (10.5281/zenodo.14882486).

/Lodz University of Technology
501/12-24-1-5418/Faculty of Electrical, Electronic, Computer and Control Engineering

Algorithms

High-Throughput Nucleotide Sequencing

Sequence Analysis, DNA

Data Compression

Software

Journal Article

No available data.

OpenLB
Open Library of Bioscience