SeqHBase: a big data toolset for family based sequencing data analysis.

Min He, Thomas N Person, Scott J Hebbring, Ethan Heinzen, Zhan Ye, Steven J Schrodi, Elizabeth W McPherson, Simon M Lin, Peggy L Peissig, Murray H Brilliant, Jason O'Rawe, Reid J Robison, Gholson J Lyon, Kai Wang
Author Information
  1. Min He: Center for Human Genetics, Marshfield Clinic Research Foundation, Marshfield, Wisconsin, USA Biomedical Informatics Research Center, Marshfield Clinic Research Foundation, Marshfield, Wisconsin, USA Department of Computation and Informatics in Biology and Medicine, University of Wisconsin-Madison, Madison, Wisconsin, USA.
  2. Thomas N Person: Center for Human Genetics, Marshfield Clinic Research Foundation, Marshfield, Wisconsin, USA.
  3. Scott J Hebbring: Center for Human Genetics, Marshfield Clinic Research Foundation, Marshfield, Wisconsin, USA Department of Computation and Informatics in Biology and Medicine, University of Wisconsin-Madison, Madison, Wisconsin, USA.
  4. Ethan Heinzen: College of Science and Engineering, University of Minnesota-Twin Cities, Minnesota, Minnesota, USA.
  5. Zhan Ye: Biomedical Informatics Research Center, Marshfield Clinic Research Foundation, Marshfield, Wisconsin, USA.
  6. Steven J Schrodi: Center for Human Genetics, Marshfield Clinic Research Foundation, Marshfield, Wisconsin, USA Department of Computation and Informatics in Biology and Medicine, University of Wisconsin-Madison, Madison, Wisconsin, USA.
  7. Elizabeth W McPherson: Department of Medical Genetics Services, Marshfield Clinic, Marshfield, Wisconsin, USA.
  8. Simon M Lin: The Research Institute at Nationwide Children's Hospital, Columbus, Ohio, USA.
  9. Peggy L Peissig: Biomedical Informatics Research Center, Marshfield Clinic Research Foundation, Marshfield, Wisconsin, USA.
  10. Murray H Brilliant: Center for Human Genetics, Marshfield Clinic Research Foundation, Marshfield, Wisconsin, USA Department of Computation and Informatics in Biology and Medicine, University of Wisconsin-Madison, Madison, Wisconsin, USA.
  11. Jason O'Rawe: Cold Spring Harbor Laboratory, Stanley Institute for Cognitive Genomics, Cold Spring Harbor, New York, USA.
  12. Reid J Robison: Utah Foundation for Biomedical Research, Provo, Utah, USA.
  13. Gholson J Lyon: Cold Spring Harbor Laboratory, Stanley Institute for Cognitive Genomics, Cold Spring Harbor, New York, USA Utah Foundation for Biomedical Research, Provo, Utah, USA.
  14. Kai Wang: Utah Foundation for Biomedical Research, Provo, Utah, USA Zilkha Neurogenetic Institute, University of Southern California, Los Angeles, California, USA Department of Psychiatry, University of Southern California, Los Angeles, California, USA.

Abstract

BACKGROUND: Whole-genome sequencing (WGS) and whole-exome sequencing (WES) technologies are increasingly used to identify disease-contributing mutations in human genomic studies. It can be a significant challenge to process such data, especially when a large family or cohort is sequenced. Our objective was to develop a big data toolset to efficiently manipulate genome-wide variants, functional annotations and coverage, together with conducting family based sequencing data analysis.
METHODS: Hadoop is a framework for reliable, scalable, distributed processing of large data sets using MapReduce programming models. Based on Hadoop and HBase, we developed SeqHBase, a big data-based toolset for analysing family based sequencing data to detect de novo, inherited homozygous, or compound heterozygous mutations that may contribute to disease manifestations. SeqHBase takes as input BAM files (for coverage at every site), variant call format (VCF) files (for variant calls) and functional annotations (for variant prioritisation).
RESULTS: We applied SeqHBase to a 5-member nuclear family and a 10-member 3-generation family with WGS data, as well as a 4-member nuclear family with WES data. Analysis times were almost linearly scalable with number of data nodes. With 20 data nodes, SeqHBase took about 5 secs to analyse WES familial data and approximately 1 min to analyse WGS familial data.
CONCLUSIONS: These results demonstrate SeqHBase's high efficiency and scalability, which is necessary as WGS and WES are rapidly becoming standard methods to study the genetics of familial disorders.

Keywords

References

  1. Genome Biol. 2009;10(11):R134 [PMID: 19930550]
  2. Nature. 2013 Sep 12;501(7466):217-21 [PMID: 23934111]
  3. Genome Med. 2014 Oct 28;6(10):89 [PMID: 25426171]
  4. Blood. 2005 Sep 1;106(5):1851-6 [PMID: 15870173]
  5. Bioinformatics. 2012 Mar 15;28(6):876-7 [PMID: 22302568]
  6. Nucleic Acids Res. 2014 Jan;42(Database issue):D980-5 [PMID: 24234437]
  7. Nature. 2012 Nov 1;491(7422):56-65 [PMID: 23128226]
  8. Ann Neurol. 2012 Apr;71(4):498-508 [PMID: 22213401]
  9. BMC Bioinformatics. 2010;11 Suppl 12:S2 [PMID: 21210981]
  10. Cardiovasc Hematol Disord Drug Targets. 2009 Jun;9(2):95-106 [PMID: 19519368]
  11. Nature. 2014 Nov 13;515(7526):216-21 [PMID: 25363768]
  12. Pract Otorhinolaryngol (Basel). 1951;13(3):129-45 [PMID: 14853741]
  13. Am J Med Genet A. 2014 Jul;164A(7):1841-5 [PMID: 24715698]
  14. Discov Med. 2011 Jul;12(62):41-55 [PMID: 21794208]
  15. Bioinformatics. 2009 Aug 15;25(16):2078-9 [PMID: 19505943]
  16. Nat Methods. 2010 Apr;7(4):248-9 [PMID: 20354512]
  17. Nature. 2012 May 10;485(7397):237-41 [PMID: 22495306]
  18. Am J Hum Genet. 2007 Mar;80(3):393-406 [PMID: 17273961]
  19. Am J Hum Genet. 2012 May 4;90(5):925-33 [PMID: 22541558]
  20. Nature. 2012 May 10;485(7397):242-5 [PMID: 22495311]
  21. Bioinformatics. 2009 Jun 1;25(11):1363-9 [PMID: 19357099]
  22. Nature. 2012 May 10;485(7397):246-50 [PMID: 22495309]
  23. J Pediatr. 1979 Dec;95(6):970-5 [PMID: 501501]
  24. Nat Genet. 2014 Mar;46(3):310-5 [PMID: 24487276]
  25. BMC Genomics. 2011;12:419 [PMID: 21851633]
  26. Nat Protoc. 2009;4(7):1073-81 [PMID: 19561590]
  27. Bioinformatics. 2014 Jan 1;30(1):119-20 [PMID: 24149054]
  28. Bioinformatics. 2011 Aug 1;27(15):2159-60 [PMID: 21697132]
  29. BMC Bioinformatics. 2012;13:200 [PMID: 22888776]
  30. Nucleic Acids Res. 2010 Sep;38(16):e164 [PMID: 20601685]
  31. Am J Med Genet. 1990 Apr;35(4):484-9 [PMID: 2333875]

Grants

  1. R01 HG006465/NHGRI NIH HHS
  2. UL1 TR000427/NCATS NIH HHS
  3. HG006465/NHGRI NIH HHS
  4. UL1TR000427/NCATS NIH HHS

MeSH Term

Datasets as Topic
Exome
Genome, Human
Genomics
Humans
Mutation
Sequence Analysis, DNA
Software

Word Cloud

Created with Highcharts 10.0.0datasequencingfamilyWGSWESmutationsbigSeqHBasetoolsetbasedvariantfamilialwhole-exomelargefunctionalannotationscoverageanalysisHadoopscalabledenovoinheritedhomozygouscompoundheterozygousfilesnuclearnodesanalyseBACKGROUND:Whole-genometechnologiesincreasinglyusedidentifydisease-contributinghumangenomicstudiescansignificantchallengeprocessespeciallycohortsequencedobjectivedevelopefficientlymanipulategenome-widevariantstogetherconductingMETHODS:frameworkreliabledistributedprocessingsetsusingMapReduceprogrammingmodelsBasedHBasedevelopeddata-basedanalysingdetectmaycontributediseasemanifestationstakesinputBAMeverysitecallformatVCFcallsprioritisationRESULTS:applied5-member10-member3-generationwell4-memberAnalysistimesalmostlinearlynumber20took5secsapproximately1minCONCLUSIONS:resultsdemonstrateSeqHBase'shighefficiencyscalabilitynecessaryrapidlybecomingstandardmethodsstudygeneticsdisordersSeqHBase:whole-genome

Similar Articles

Cited By