A Large-Scale and Serverless Computational Approach for Improving Quality of NGS Data Supporting Big Multi-Omics Data Analyses.

Advanced Search

Dariusz Mrozek, Krzysztof Stępień, Piotr Grzesik, Bożena Małysiak-Mrozek

Author Information

Dariusz Mrozek: Department of Applied Informatics, Silesian University of Technology, Gliwice, Poland.
Krzysztof Stępień: Department of Applied Informatics, Silesian University of Technology, Gliwice, Poland.
Piotr Grzesik: Department of Applied Informatics, Silesian University of Technology, Gliwice, Poland.
Bożena Małysiak-Mrozek: Department of Graphics, Computer Vision and Digital Systems, Silesian University of Technology, Gliwice, Poland.

PMID: 34326863 DOI: 10.3389/fgene.2021.699280

Various types of analyses performed over multi-omics data are driven today by next-generation sequencing (NGS) techniques that produce large volumes of DNA/RNA sequences. Although many tools allow for parallel processing of NGS data in a Big Data distributed environment, they do not facilitate the improvement of the quality of NGS data for a large scale in a simple declarative manner. Meanwhile, large sequencing projects and routine DNA/RNA sequencing associated with molecular profiling of diseases for personalized treatment require both good quality data and appropriate infrastructure for efficient storing and processing of the data. To solve the problems, we adapt the concept of Data Lake for storing and processing big NGS data. We also propose a dedicated library that allows cleaning the DNA/RNA sequences obtained with single-read and paired-end sequencing techniques. To accommodate the growth of NGS data, our solution is largely scalable on the Cloud and may rapidly and flexibly adjust to the amount of data that should be processed. Moreover, to simplify the utilization of the data cleaning methods and implementation of other phases of data analysis workflows, our library extends the declarative U-SQL query language providing a set of capabilities for data extraction, processing, and storing. The results of our experiments prove that the whole solution supports requirements for ample storage and highly parallel, scalable processing that accompanies NGS-based multi-omics data analyses.

OMICS data big data cloud computing data lake data quality next-generation sequencing querying serverless

BMC Res Notes. 2016 Feb 12;9:88 [PMID: 26868221]
BMC Bioinformatics. 2016 May 10;17:208 [PMID: 27161244]
PLoS One. 2011;6(10):e26314 [PMID: 22039460]
Bioinformatics. 2015 Jun 15;31(12):1881-8 [PMID: 25649616]
Mol Ecol Resour. 2014 Mar;14(2):426-34 [PMID: 24128146]
Genome Res. 1998 Mar;8(3):186-94 [PMID: 9521922]
Bioinformatics. 2011 Mar 15;27(6):863-4 [PMID: 21278185]
PLoS One. 2013 Dec 23;8(12):e85024 [PMID: 24376861]
Bioinformatics. 2017 Sep 15;33(18):2941-2942 [PMID: 28541403]
IEEE/ACM Trans Comput Biol Bioinform. 2020 May-Jun;17(3):728-738 [PMID: 30736001]
BMC Res Notes. 2012 Jul 02;5:337 [PMID: 22748135]
Bioinformatics. 2019 Mar 1;35(5):729-736 [PMID: 30101316]
Int J Mol Sci. 2015 Jan 05;16(1):1096-110 [PMID: 25569088]
BMC Bioinformatics. 2015;16 Suppl 1:S2 [PMID: 25707528]
Gigascience. 2019 Aug 1;8(8): [PMID: 31378808]
Bioinformatics. 2014 Aug 1;30(15):2114-20 [PMID: 24695404]
BMC Bioinformatics. 2014 Jun 12;15:182 [PMID: 24925680]
Biology (Basel). 2012 Dec 14;1(3):895-905 [PMID: 24832523]
J Mol Biol. 1970 Mar;48(3):443-53 [PMID: 5420325]
Genomics. 2011 Aug;98(2):152-3 [PMID: 21651976]
BMC Bioinformatics. 2016 Feb 02;17:56 [PMID: 26830926]
Genomics. 2013 Nov-Dec;102(5-6):500-6 [PMID: 23912058]
Bioinformatics. 2014 Jan 1;30(1):119-20 [PMID: 24149054]
BMC Bioinformatics. 2017 Mar 14;18(Suppl 3):80 [PMID: 28361673]
Nucleic Acids Res. 2011 Jan;39(Database issue):D19-21 [PMID: 21062823]
PeerJ. 2019 Jun 19;7:e7170 [PMID: 31249738]
Methods. 2013 Sep 1;63(1):41-9 [PMID: 23816787]
Bioinformatics. 2015 Dec 15;31(24):4003-5 [PMID: 26323715]
BMC Bioinformatics. 2015 Apr 29;16:137 [PMID: 25924884]
Methods. 2016 Dec 1;111:3-11 [PMID: 27637471]
F1000Res. 2018 Aug 24;7:1338 [PMID: 30254741]
BMC Bioinformatics. 2019 May 10;20(1):236 [PMID: 31077131]

Journal Article

OpenLB
Open Library of Bioscience