Distributed hybrid-indexing of compressed pan-genomes for scalable and fast sequence alignment.

Advanced Search

Altti Ilari Maarala, Ossi Arasalo, Daniel Valenzuela, Veli Mäkinen, Keijo Heljanko

Author Information

Altti Ilari Maarala: Department of Computer Science, University of Helsinki, Espoo, Finland. ORCID
Ossi Arasalo: Department of Computer Science, Aalto University, Espoo, Finland.
Daniel Valenzuela: Department of Computer Science, University of Helsinki, Espoo, Finland.
Veli Mäkinen: Department of Computer Science, University of Helsinki, Espoo, Finland.
Keijo Heljanko: Department of Computer Science, University of Helsinki, Espoo, Finland.

PMID: 34343181 DOI: 10.1371/journal.pone.0255260

Computational pan-genomics utilizes information from multiple individual genomes in large-scale comparative analysis. Genetic variation between case-controls, ethnic groups, or species can be discovered thoroughly using pan-genomes of such subpopulations. Whole-genome sequencing (WGS) data volumes are growing rapidly, making genomic data compression and indexing methods very important. Despite current space-efficient repetitive sequence compression and indexing methods, the deployed compression methods are often sequential, computationally time-consuming, and do not provide efficient sequence alignment performance on vast collections of genomes such as pan-genomes. For performing rapid analytics with the ever-growing genomics data, data compression and indexing methods have to exploit distributed and parallel computing more efficiently. Instead of strict genome data compression methods, we will focus on the efficient construction of a compressed index for pan-genomes. Compressed hybrid-index enables fast sequence alignments to several genomes at once while shrinking the index size significantly compared to traditional indexes. We propose a scalable distributed compressed hybrid-indexing method for large genomic data sets enabling pan-genome-based sequence search and read alignment capabilities. We show the scalability of our tool, DHPGIndex, by executing experiments in a distributed Apache Spark-based computing cluster comprising 448 cores distributed over 26 nodes. The experiments have been performed both with human and bacterial genomes. DHPGIndex built a BLAST index for n = 250 human pan-genome with an 870:1 compression ratio (CR) in 342 minutes and a Bowtie2 index with 157:1 CR in 397 minutes. For n = 1,000 human pan-genome, the BLAST index was built in 1520 minutes with 532:1 CR and the Bowtie2 index in 1938 minutes with 76:1 CR. Bowtie2 aligned 14.6 GB of paired-end reads to the compressed (n = 1,000) index in 31.7 minutes on a single node. Compressing n = 13,375,031 (488 GB) GenBank database to BLAST index resulted in CR of 62:1 in 575 minutes. BLASTing 189,864 Crispr-Cas9 gRNA target sequences (23 MB in total) to the compressed index of human pan-genome (n = 1,000) finished in 45 minutes on a single node. 30 MB mixed bacterial sequences were (n = 599) were blasted to the compressed index of 488 GB GenBank database (n = 13,375,031) in 26 minutes on 25 nodes. 78 MB mixed sequences (n = 4,167) were blasted to the compressed index of 18 GB E. coli sequence database (n = 745,409) in 5.4 minutes on a single node.

Bioinformatics. 2015 Aug 1;31(15):2482-8 [PMID: 25819078]
Genome Biol. 2009;10(3):R25 [PMID: 19261174]
Brief Bioinform. 2018 Jan 1;19(1):118-135 [PMID: 27769991]
Bioinformatics. 2009 Aug 1;25(15):1966-7 [PMID: 19497933]
Bioinformatics. 2011 Aug 1;27(15):2159-60 [PMID: 21697132]
Front Genet. 2019 Feb 12;10:49 [PMID: 30809243]
Bioinformatics. 2013 Jul 01;29(13):i361-70 [PMID: 23813006]
J Comput Biol. 2020 Apr;27(4):500-513 [PMID: 32181684]
Bioinformatics. 2018 Mar 15;34(6):928-935 [PMID: 29106455]
Bioinformatics. 2012 Mar 15;28(6):876-7 [PMID: 22302568]
Philos Trans A Math Phys Eng Sci. 2014 Apr 21;372(2016):20130137 [PMID: 24751871]
Science. 2013 Feb 15;339(6121):823-6 [PMID: 23287722]
Nat Rev Genet. 2020 Apr;21(4):243-254 [PMID: 32034321]
J Mol Biol. 1990 Oct 5;215(3):403-10 [PMID: 2231712]
Bioinformatics. 2009 Jul 15;25(14):1754-60 [PMID: 19451168]
IEEE/ACM Trans Comput Biol Bioinform. 2014 Mar-Apr;11(2):375-88 [PMID: 26355784]
J Comput Biol. 2010 Mar;17(3):281-308 [PMID: 20377446]
Genome Biol. 2009;10(9):R98 [PMID: 19761611]
Front Bioeng Biotechnol. 2015 Feb 09;3:12 [PMID: 25710001]
Bioinformatics. 2014 Jan 1;30(1):119-20 [PMID: 24149054]
EMBnet J. 2018;24: [PMID: 29782620]
Annu Rev Pathol. 2019 Jan 24;14:319-338 [PMID: 30355154]

Base Sequence

Data Compression

Escherichia coli

Genome, Bacterial

Genome, Human

High-Throughput Nucleotide Sequencing

Humans

Sequence Alignment

Journal Article Research Support, Non-U.S. Gov't

OpenLB
Open Library of Bioscience