ADS-HCSpark: A scalable HaplotypeCaller leveraging adaptive data segmentation to accelerate variant calling on Spark.

Anghong Xiao, Zongze Wu, Shoubin Dong
Author Information
  1. Anghong Xiao: Communication & Computer Network Lab of Guangdong, School of Computer Science & Engineering, South China University of Technology, Wushan Road, Guangzhou, 510641, China.
  2. Zongze Wu: Communication & Computer Network Lab of Guangdong, School of Computer Science & Engineering, South China University of Technology, Wushan Road, Guangzhou, 510641, China.
  3. Shoubin Dong: Communication & Computer Network Lab of Guangdong, School of Computer Science & Engineering, South China University of Technology, Wushan Road, Guangzhou, 510641, China. sbdong@scut.edu.cn.

Abstract

BACKGROUND: The advance of next generation sequencing enables higher throughput with lower price, and as the basic of high-throughput sequencing data analysis, variant calling is widely used in disease research, clinical treatment and medicine research. However, current mainstream variant caller tools have a serious problem of computation bottlenecks, resulting in some long tail tasks when performing on large datasets. This prevents high scalability on clusters of multi-node and multi-core, and leads to long runtime and inefficient usage of computing resources. Thus, a high scalable tool which could run in distributed environment will be highly useful to accelerate variant calling on large scale genome data.
RESULTS: In this paper, we present ADS-HCSpark, a scalable tool for variant calling based on Apache Spark framework. ADS-HCSpark accelerates the process of variant calling by implementing the parallelization of mainstream GATK HaplotypeCaller algorithm on multi-core and multi-node. Aiming at solving the problem of computation skew in HaplotypeCaller, a parallel strategy of adaptive data segmentation is proposed and a variant calling algorithm based on adaptive data segmentation is implemented, which achieves good scalability on both single-node and multi-node. For the requirement that adjacent data blocks should have overlapped boundaries, Hadoop-BAM library is customized to implement partitioning BAM file into overlapped blocks, further improving the accuracy of variant calling.
CONCLUSIONS: ADS-HCSpark is a scalable tool to achieve variant calling based on Apache Spark framework, implementing the parallelization of GATK HaplotypeCaller algorithm. ADS-HCSpark is evaluated on our cluster and in the case of best performance that could be achieved in this experimental platform, ADS-HCSpark is 74% faster than GATK3.8 HaplotypeCaller on single-node experiments, 57% faster than GATK4.0 HaplotypeCallerSpark and 27% faster than SparkGA on multi-node experiments, with better scalability and the accuracy of over 99%. The source code of ADS-HCSpark is publicly available at https://github.com/SCUT-CCNL/ADS-HCSpark.git .

Keywords

References

  1. Bioinformatics. 2009 Aug 15;25(16):2078-9 [PMID: 19505943]
  2. Genome Res. 2010 Sep;20(9):1297-303 [PMID: 20644199]
  3. BMC Bioinformatics. 2010 Dec 21;11 Suppl 12:S1 [PMID: 21210976]
  4. BMC Bioinformatics. 2010 Dec 21;11 Suppl 12:S2 [PMID: 21210981]
  5. Bioinformatics. 2012 Mar 15;28(6):876-7 [PMID: 22302568]
  6. Bioinformatics. 2014 Oct;30(19):2787-95 [PMID: 24894505]
  7. Genome Biol. 2015 Jan 20;16:6 [PMID: 25600152]
  8. Nat Commun. 2015 Feb 25;6:6275 [PMID: 25711446]
  9. Bioinformatics. 2015 Aug 1;31(15):2482-8 [PMID: 25819078]
  10. Sci Rep. 2015 Dec 07;5:17875 [PMID: 26639839]
  11. Virus Res. 2017 Jul 15;239:10-16 [PMID: 27497916]

Grants

  1. 2015A030308017/Natural Science Foundation of Guangdong Province

MeSH Term

Algorithms
Databases, Genetic
Genetic Variation
Genome
Haplotypes
High-Throughput Nucleotide Sequencing
Humans
Sequence Analysis, DNA
Software
Time Factors

Word Cloud

Created with Highcharts 10.0.0variantcallingdataADS-HCSparkHaplotypeCallermulti-nodescalableSparksegmentationscalabilitytoolbasedalgorithmadaptivefastersequencingresearchmainstreamproblemcomputationlonglargehighmulti-coreaccelerateApacheframeworkimplementingparallelizationGATKsingle-nodeblocksoverlappedHadoop-BAMaccuracyexperimentsBACKGROUND:advancenextgenerationenableshigherthroughputlowerpricebasichigh-throughputanalysiswidelyuseddiseaseclinicaltreatmentmedicineHowevercurrentcallertoolsseriousbottlenecksresultingtailtasksperformingdatasetspreventsclustersleadsruntimeinefficientusagecomputingresourcesThusrundistributedenvironmentwillhighlyusefulscalegenomeRESULTS:paperpresentacceleratesprocessAimingsolvingskewparallelstrategyproposedimplementedachievesgoodrequirementadjacentboundarieslibrarycustomizedimplementpartitioningBAMfileimprovingCONCLUSIONS:achieveevaluatedclustercasebestperformanceachievedexperimentalplatform74%GATK3857%GATK40HaplotypeCallerSpark27%SparkGAbetter99%sourcecodepubliclyavailablehttps://githubcom/SCUT-CCNL/ADS-HCSparkgitADS-HCSpark:leveragingAdaptiveVariant

Similar Articles

Cited By