GAAP: Genome-organization-framework-Assisted Assembly Pipeline for prokaryotic genomes.

Lina Yuan, Yang Yu, Yanmin Zhu, Yulai Li, Changqing Li, Rujiao Li, Qin Ma, Gilman Kit-Hang Siu, Jun Yu, Taijiao Jiang, Jingfa Xiao, Yu Kang
Author Information
  1. Lina Yuan: CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, 100101, China.
  2. Yang Yu: School of Life Sciences, Liaoning University, Shenyang, 110036, China.
  3. Yanmin Zhu: CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, 100101, China.
  4. Yulai Li: Beijing SpeedyCloud Technologies Co., Ltd., Beijing, 110036, China.
  5. Changqing Li: Department of Otolaryngology, Beijing Geriatric Hospital, Beijing, 100095, China.
  6. Rujiao Li: CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, 100101, China.
  7. Qin Ma: Department of Agronomy, Horticulture, and Plant Science, South Dakota State University, Brookings, SD, 57007, USA.
  8. Gilman Kit-Hang Siu: Department of Health Technology and Informatics, The Hong Kong Polytechnic University, Hong Kong, China.
  9. Jun Yu: CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, 100101, China.
  10. Taijiao Jiang: Center for Systems Medicine, Institute of Basic Medical Sciences, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing, 100005, China.
  11. Jingfa Xiao: CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, 100101, China. xiaojf@big.ac.cn.
  12. Yu Kang: CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, 100101, China. kangy@big.ac.cn.

Abstract

BACKGROUND: Next-generation sequencing (NGS) technologies have greatly promoted the genomic study of prokaryotes. However, highly fragmented assemblies due to short reads from NGS are still a limiting factor in gaining insights into the genome biology. Reference-assisted tools are promising in genome assembly, but tend to result in false assembly when the assigned reference has extensive rearrangements.
RESULTS: Herein, we present GAAP, a genome assembly pipeline for scaffolding based on core-gene-defined Genome Organizational Framework (cGOF) described in our previous study. Instead of assigning references, we use the multiple-reference-derived cGOFs as indexes to assist in order and orientation of the scaffolds and build a skeleton structure, and then use read pairs to extend scaffolds, called local scaffolding, and distinguish between true and chimeric adjacencies in the scaffolds. In our performance tests using both empirical and simulated data of 15 genomes in six species with diverse genome size, complexity, and all three categories of cGOFs, GAAP outcompetes or achieves comparable results when compared to three other reference-assisted programs, AlignGraph, Ragout and MeDuSa.
CONCLUSIONS: GAAP uses both cGOF and pair-end reads to create assemblies in genomic scale, and performs better than the currently available reference-assisted assembly tools as it recovers more assemblies and makes fewer false locations, especially for species with extensive rearranged genomes. Our method is a promising solution for reconstruction of genome sequence from short reads of NGS.

Keywords

References

Gigascience. 2012 Dec 27;1(1):18 [PMID: 23587118]
Nucleic Acids Res. 2003 Nov 15;31(22):6570-7 [PMID: 14602916]
Genome Res. 2008 May;18(5):821-9 [PMID: 18349386]
MBio. 2014 Nov 25;5(6):e01867 [PMID: 25425232]
BMC Genomics. 2015 Feb 05;16:37 [PMID: 25652224]
Microbiol Mol Biol Rev. 2014 Mar;78(1):1-39 [PMID: 24600039]
Bioinformatics. 2009 Aug 15;25(16):2078-9 [PMID: 19505943]
J Bacteriol. 2011 Jul;193(13):3428-9 [PMID: 21572001]
Bioinformatics. 2014 Jun 15;30(12):i302-9 [PMID: 24931998]
Genome Res. 2014 Jun;24(6):974-88 [PMID: 24671853]
Genome Res. 2009 Jun;19(6):1117-23 [PMID: 19251739]
Bioinformatics. 2012 Feb 1;28(3):416-8 [PMID: 22130594]
Bioinformatics. 2015 Aug 1;31(15):2443-51 [PMID: 25810435]
BMC Genomics. 2011 Oct 25;12:523 [PMID: 22026465]
Bioinformatics. 2014 Jun 15;30(12):i319-i328 [PMID: 24932000]
Genome Biol. 2013;14(9):R101 [PMID: 24034426]
Proc Natl Acad Sci U S A. 2013 Jan 29;110(5):1785-90 [PMID: 23307812]
Proc Natl Acad Sci U S A. 2008 May 13;105(19):6976-81 [PMID: 18460604]
Mol Syst Biol. 2012;8:610 [PMID: 22968444]
BMC Genomics. 2013 May 08;14:309 [PMID: 23651581]

MeSH Term

Algorithms
Computational Biology
Genome
Genomics
Prokaryotic Cells
Reproducibility of Results

Word Cloud

Similar Articles

Cited By