DECA: scalable XHMM exome copy-number variant calling with ADAM and Apache Spark.

Michael D Linderman, Davin Chia, Forrest Wallace, Frank A Nothaft
Author Information
  1. Michael D Linderman: Department of Computer Science, Middlebury College, 75 Shannon St, Middlebury, VT, 05753, USA. mlinderman@middlebury.edu. ORCID
  2. Davin Chia: Department of Computer Science, Middlebury College, 75 Shannon St, Middlebury, VT, 05753, USA.
  3. Forrest Wallace: Department of Computer Science, Middlebury College, 75 Shannon St, Middlebury, VT, 05753, USA.
  4. Frank A Nothaft: AMPLab, University of California, Berkeley, Berkeley, CA, USA.

Abstract

BACKGROUND: XHMM is a widely used tool for copy-number variant (CNV) discovery from whole exome sequencing data but can require hours to days to run for large cohorts. A more scalable implementation would reduce the need for specialized computational resources and enable increased exploration of the configuration parameter space to obtain the best possible results.
RESULTS: DECA is a horizontally scalable implementation of the XHMM algorithm using the ADAM framework and Apache Spark that incorporates novel algorithmic optimizations to eliminate unneeded computation. DECA parallelizes XHMM on both multi-core shared memory computers and large shared-nothing Spark clusters. We performed CNV discovery from the read-depth matrix in 2535 exomes in 9.3 min on a 16-core workstation (35.3× speedup vs. XHMM), 12.7 min using 10 executor cores on a Spark cluster (18.8× speedup vs. XHMM), and 9.8 min using 32 executor cores on Amazon AWS' Elastic MapReduce. We performed CNV discovery from the original BAM files in 292 min using 640 executor cores on a Spark cluster.
CONCLUSIONS: We describe DECA's performance, our algorithmic and implementation enhancements to XHMM to obtain that performance, and our lessons learned porting a complex genome analysis application to ADAM and Spark. ADAM and Apache Spark are a performant and productive platform for implementing large-scale genome analyses, but efficiently utilizing large clusters can require algorithmic optimizations and careful attention to Spark's configuration parameters.

Keywords

References

  1. Bioinformatics. 2017 Jan 15;33(2):303-305 [PMID: 27663493]
  2. Am J Hum Genet. 2012 Oct 5;91(4):597-607 [PMID: 23040492]
  3. Genome Res. 2010 Sep;20(9):1297-303 [PMID: 20644199]
  4. Bioinformatics. 2014 Sep 15;30(18):2652-3 [PMID: 24845651]
  5. BMC Bioinformatics. 2013;14 Suppl 11:S1 [PMID: 24564169]
  6. BMC Genomics. 2015 Dec 10;16:1052 [PMID: 26651996]
  7. Am J Hum Genet. 2017 Jul 6;101(1):115-122 [PMID: 28669402]
  8. Curr Protoc Hum Genet. 2014 Apr 24;81:7.23.1-21 [PMID: 24763994]
  9. Gigascience. 2018 Aug 1;7(8): [PMID: 30101283]
  10. Nat Genet. 2016 Oct;48(10):1107-11 [PMID: 27533299]
  11. Bioinformatics. 2012 Mar 15;28(6):876-7 [PMID: 22302568]
  12. Nature. 2015 Oct 1;526(7571):68-74 [PMID: 26432245]
  13. Bioinformatics. 2016 Jan 1;32(1):133-5 [PMID: 26382196]

Grants

  1. CCF-1139158/National Science Foundation (US)
  2. 7076018/Lawrence Berkeley National Laboratory (US)
  3. FA8750-12-2-0331/Defense Advanced Research Projects Agency
  4. U54HG007990-01/NHGRI NIH HHS
  5. HHSN261201400006C/NIH HHS

MeSH Term

Algorithms
DNA Copy Number Variations
Exome
High-Throughput Nucleotide Sequencing
Exome Sequencing

Word Cloud

Created with Highcharts 10.0.0XHMMSparkusingADAMCNVdiscoverylargescalableimplementationApachealgorithmicexecutorcorescopy-numbervariantexomesequencingcanrequireconfigurationobtainDECAoptimizationsclustersperformed9speedupvsclusterperformancegenomeBACKGROUND:widelyusedtoolwholedatahoursdaysruncohortsreduceneedspecializedcomputationalresourcesenableincreasedexplorationparameterspacebestpossibleresultsRESULTS:horizontallyalgorithmframeworkincorporatesnoveleliminateunneededcomputationparallelizesmulti-coresharedmemorycomputersshared-nothingread-depthmatrix2535exomes3 min16-coreworkstation35127 min10188 min32AmazonAWS'ElasticMapReduceoriginalBAMfiles292 min640CONCLUSIONS:describeDECA'senhancementslessonslearnedportingcomplexanalysisapplicationperformantproductiveplatformimplementinglarge-scaleanalysesefficientlyutilizingcarefulattentionSpark'sparametersDECA:callingCopy-numbervariationExomeHigh-performancecomputing

Similar Articles

Cited By