VARUS: sampling complementary RNA reads from the sequence read archive.

Advanced Search

Mario Stanke, Willy Bruhn, Felix Becker, Katharina J Hoff

Author Information

Mario Stanke: Institute for Mathematics and Computer Science, University of Greifswald, Walther-Rathenau-Str. 47, Greifswald, 17489, Germany. mario.stanke@uni-greifswald.de. ORCID
Willy Bruhn: Institute for Mathematics and Computer Science, University of Greifswald, Walther-Rathenau-Str. 47, Greifswald, 17489, Germany.
Felix Becker: Institute for Mathematics and Computer Science, University of Greifswald, Walther-Rathenau-Str. 47, Greifswald, 17489, Germany.
Katharina J Hoff: Institute for Mathematics and Computer Science, University of Greifswald, Walther-Rathenau-Str. 47, Greifswald, 17489, Germany.

PMID: 31703556 DOI: 10.1186/s12859-019-3182-x

BACKGROUND: Vast amounts of next generation sequencing RNA data has been deposited in archives, accompanying very diverse original studies. The data is readily available also for other purposes such as genome annotation or transcriptome assembly. However, selecting a subset of available experiments, sequencing runs and reads for this purpose is a nontrivial task and complicated by the inhomogeneity of the data.
RESULTS: This article presents the software VARUS that selects, downloads and aligns reads from NCBI's Sequence Read Archive, given only the species' binomial name and genome. VARUS automatically chooses runs from among all archived runs to randomly select subsets of reads. The objective of its online algorithm is to cover a large number of transcripts adequately when network bandwidth and computing resources are limited. For most tested species VARUS achieved both a higher sensitivity and specificity with a lower number of downloaded reads than when runs were manually selected. At the example of twelve eukaryotic genomes, we show that RNA-Seq that was sampled with VARUS is well-suited for fully-automatic genome annotation with BRAKER.
CONCLUSIONS: With VARUS, genome annotation can be automatized to the extent that not even the selection and quality control of RNA-Seq has to be done manually. This introduces the possibility to have fully automatized genome annotation loops over potentially many species without incurring a loss of accuracy over a manually supervised annotation process.

Genome annotation Online algorithm RNA-Seq Sample

Bioinformatics. 2008 Mar 1;24(5):637-44 [PMID: 18218656]
Gigascience. 2017 Jun 1;6(6):1-8 [PMID: 28449062]
BMC Bioinformatics. 2003 Oct 17;4:50 [PMID: 14565849]
Nat Biotechnol. 2015 Mar;33(3):290-5 [PMID: 25690850]
Nucleic Acids Res. 2011 Jan;39(Database issue):D19-21 [PMID: 21062823]
Bioinformatics. 2010 Jan 1;26(1):139-40 [PMID: 19910308]
PLoS One. 2017 Dec 21;12(12):e0190152 [PMID: 29267363]
Curr Protoc Bioinformatics. 2019 Mar;65(1):e57 [PMID: 30466165]
Bioinformatics. 2013 Jan 1;29(1):15-21 [PMID: 23104886]
Nat Methods. 2015 Apr;12(4):357-60 [PMID: 25751142]
Nucleic Acids Res. 2011 Jan;39(Database issue):D28-31 [PMID: 20972220]
Bioinformatics. 2009 Aug 15;25(16):2078-9 [PMID: 19505943]
Bioinformatics. 2016 Mar 1;32(5):767-9 [PMID: 26559507]
Nucleic Acids Res. 2014 Sep;42(15):e119 [PMID: 24990371]
IEEE/ACM Trans Comput Biol Bioinform. 2013 May-Jun;10(3):645-56 [PMID: 24091398]

R01 GM128145/NIGMS NIH HHS

Algorithms

Animals

Databases, Genetic

Drosophila melanogaster

Eukaryota

High-Throughput Nucleotide Sequencing

Introns

Molecular Sequence Annotation

RNA, Complementary

Sequence Analysis, RNA

Software

Transcriptome

RNA, Complementary

Journal Article

GeneMark-EP+: eukaryotic gene prediction with self-training in the space of genes and proteins.BRAKER3: Fully automated genome annotation using RNA-seq and protein evidence with GeneMark-ETP, AUGUSTUS, and TSEBRA.TSEBRA: transcript selector for BRAKER.Galba: genome annotation with miniprot and AUGUSTUS.BRAKER2: automatic eukaryotic genome annotation with GeneMark-EP+ and AUGUSTUS supported by a protein database.GALBA: Genome Annotation with Miniprot and AUGUSTUS.

OpenLB
Open Library of Bioscience