Data structures based on -mers for querying large collections of sequencing data sets.
Camille Marchet, Christina Boucher, Simon J Puglisi, Paul Medvedev, Mikaël Salson, Rayan Chikhi
Author Information
Camille Marchet: Université de Lille, CNRS, CRIStAL UMR 9189, F-59000 Lille, France. ORCID
Christina Boucher: Department of Computer and Information Science and Engineering, University of Florida, Gainesville, Florida 32611, USA.
Simon J Puglisi: Department of Computer Science, University of Helsinki, FI-00014, Helsinki, Finland. ORCID
Paul Medvedev: Department of Computer Science, The Pennsylvania State University, University Park, Pennsylvania 16802, USA. ORCID
Mikaël Salson: Université de Lille, CNRS, CRIStAL UMR 9189, F-59000 Lille, France.
Rayan Chikhi: Institut Pasteur & CNRS, C3BI USR 3756, F-75015 Paris, France. ORCID
中文译文
English
High-throughput sequencing data sets are usually deposited in public repositories (e.g., the European Nucleotide Archive) to ensure reproducibility. As the amount of data has reached petabyte scale, repositories do not allow one to perform online sequence searches, yet, such a feature would be highly useful to investigators. Toward this goal, in the last few years several computational approaches have been introduced to index and query large collections of data sets. Here, we propose an accessible survey of these approaches, which are generally based on representing data sets as sets of -mers. We review their properties, introduce a classification, and present their general intuition. We summarize their performance and highlight their current strengths and limitations.
mSphere. 2020 Apr 15;5(2):
[PMID: 32295868 ]
J Comput Biol. 2020 Apr;27(4):485-499
[PMID: 32176522 ]
Science. 2015 Nov 20;350(6263):928-32
[PMID: 26586757 ]
Nat Biotechnol. 2016 May;34(5):525-7
[PMID: 27043002 ]
Bioinformatics. 2019 Feb 1;35(3):407-414
[PMID: 30020403 ]
Bioinformatics. 2017 Dec 15;33(24):4024-4032
[PMID: 27659452 ]
Bioinformatics. 2020 Jul 1;36(Suppl_1):i177-i185
[PMID: 32657392 ]
Algorithms Mol Biol. 2020 Apr 07;15:4
[PMID: 32280365 ]
BMJ. 2018 Apr 24;361:k1687
[PMID: 29691228 ]
J Comput Biol. 2020 Apr;27(4):626-639
[PMID: 31891531 ]
Microbiome. 2016 Jun 03;4(1):24
[PMID: 27255532 ]
Nat Rev Genet. 2016 May;17(5):257-71
[PMID: 26996076 ]
Genome Biol. 2018 Oct 19;19(1):167
[PMID: 30340508 ]
Nat Methods. 2012 Apr 27;9(5):459-62
[PMID: 22543379 ]
Algorithms Mol Biol. 2016 Apr 14;11:3
[PMID: 27087830 ]
Bioinformatics. 2020 Feb 1;36(3):721-727
[PMID: 31504157 ]
Microb Genom. 2018 Jul;4(7):
[PMID: 29906258 ]
Algorithms Mol Biol. 2013 Sep 16;8(1):22
[PMID: 24040893 ]
Genome Biol. 2020 Sep 17;21(1):249
[PMID: 32943081 ]
Genome Biol. 2014 Mar 03;15(3):R46
[PMID: 24580807 ]
Bioinformatics. 2014 Oct;30(19):2796-801
[PMID: 24950811 ]
Nat Biotechnol. 2014 May;32(5):462-4
[PMID: 24752080 ]
Nucleic Acids Res. 2011 Jan;39(Database issue):D19-21
[PMID: 21062823 ]
Bioinformatics. 2014 Jan 1;30(1):31-7
[PMID: 23732276 ]
Nature. 2013 Sep 26;501(7468):506-11
[PMID: 24037378 ]
Bioinformatics. 2018 Sep 1;34(17):i766-i772
[PMID: 30423080 ]
Nucleic Acids Res. 2019 Jan 8;47(D1):D15-D22
[PMID: 30445657 ]
Contemp Oncol (Pozn). 2015;19(1A):A68-77
[PMID: 25691825 ]
Bioinformatics. 2021 Sep 29;37(18):2858-2865
[PMID: 33821954 ]
Bioinformatics. 2012 Jun 1;28(11):1415-9
[PMID: 22556365 ]
Bioinformatics. 2017 Oct 15;33(20):3181-3187
[PMID: 28200001 ]
Cell Syst. 2018 Aug 22;7(2):201-207.e4
[PMID: 29936185 ]
Nat Biotechnol. 2016 Mar;34(3):300-2
[PMID: 26854477 ]
Genome Biol. 2021 Jan 11;22(1):30
[PMID: 33430919 ]
J Comput Biol. 2018 Jul;25(7):755-765
[PMID: 29641248 ]
Nat Biotechnol. 2019 Feb;37(2):152-159
[PMID: 30718882 ]
Genome Res. 2017 Feb;27(2):300-309
[PMID: 27986821 ]
Bioinformatics. 2019 Jul 15;35(14):i51-i60
[PMID: 31510647 ]
Nat Genet. 2012 Jan 08;44(2):226-32
[PMID: 22231483 ]
Nat Methods. 2017 Apr;14(4):417-419
[PMID: 28263959 ]
Bioinformatics. 2016 Jun 15;32(12):i201-i208
[PMID: 27307618 ]
Bioinformatics. 2018 Jul 1;34(13):i169-i177
[PMID: 29949982 ]
Genome Biol. 2016 Jun 20;17(1):132
[PMID: 27323842 ]
Nat Biotechnol. 2021 Jan;39(1):105-114
[PMID: 32690973 ]
Proc Natl Acad Sci U S A. 2012 Mar 20;109(12):4550-5
[PMID: 22393007 ]
R01 AI141810/NIAID NIH HHS
Algorithms
High-Throughput Nucleotide Sequencing
Reproducibility of Results
Software