A comparison of three programming languages for a full-fledged next-generation sequencing tool.

Pascal Costanza, Charlotte Herzeel, Wilfried Verachtert
Author Information
  1. Pascal Costanza: imec, ExaScience Lab, Kapeldreef 75, Leuven, 3001, Belgium.
  2. Charlotte Herzeel: imec, ExaScience Lab, Kapeldreef 75, Leuven, 3001, Belgium.
  3. Wilfried Verachtert: imec, ExaScience Lab, Kapeldreef 75, Leuven, 3001, Belgium.

Abstract

BACKGROUND: elPrep is an established multi-threaded framework for preparing SAM and BAM files in sequencing pipelines. To achieve good performance, its software architecture makes only a single pass through a SAM/BAM file for multiple preparation steps, and keeps sequencing data as much as possible in main memory. Similar to other SAM/BAM tools, management of heap memory is a complex task in elPrep, and it became a serious productivity bottleneck in its original implementation language during recent further development of elPrep. We therefore investigated three alternative programming languages: Go and Java using a concurrent, parallel garbage collector on the one hand, and C++17 using reference counting on the other hand for handling large amounts of heap objects. We reimplemented elPrep in all three languages and benchmarked their runtime performance and memory use.
RESULTS: The Go implementation performs best, yielding the best balance between runtime performance and memory use. While the Java benchmarks report a somewhat faster runtime than the Go benchmarks, the memory use of the Java runs is significantly higher. The C++17 benchmarks run significantly slower than both Go and Java, while using somewhat more memory than the Go runs. Our analysis shows that concurrent, parallel garbage collection is better at managing a large heap of objects than reference counting in our case.
CONCLUSIONS: Based on our benchmark results, we selected Go as our new implementation language for elPrep, and recommend considering Go as a good candidate for developing other bioinformatics tools for processing SAM/BAM data as well.

Keywords

References

  1. BMC Bioinformatics. 2008 Feb 05;9:82 [PMID: 18251993]
  2. Bioinformatics. 2009 Aug 15;25(16):2078-9 [PMID: 19505943]
  3. Genome Res. 2010 Sep;20(9):1297-303 [PMID: 20644199]
  4. Bioinformatics. 2012 Mar 15;28(6):876-7 [PMID: 22302568]
  5. Bioinformatics. 2014 Sep 1;30(17):2503-5 [PMID: 24812344]
  6. Curr Protoc Bioinformatics. 2013;43:11.10.1-33 [PMID: 25431634]
  7. Bioinformatics. 2015 Jun 15;31(12):2032-4 [PMID: 25697820]
  8. Bioinformatics. 2015 Aug 1;31(15):2482-8 [PMID: 25819078]
  9. Genome Res. 2015 Jun;25(6):918-25 [PMID: 25883319]
  10. PLoS One. 2015 Jul 16;10(7):e0132868 [PMID: 26182406]
  11. Genome Biol. 2016 Jun 01;17(1):118 [PMID: 27250555]
  12. Sci Rep. 2016 Jul 25;6:30425 [PMID: 27451921]
  13. PLoS One. 2016 Oct 5;11(10):e0163962 [PMID: 27706213]
  14. PLoS One. 2017 Mar 30;12(3):e0174575 [PMID: 28358893]
  15. PLoS One. 2019 Feb 13;14(2):e0209523 [PMID: 30759172]

MeSH Term

Benchmarking
High-Throughput Nucleotide Sequencing
Humans
Programming Languages
Software
Time Factors

Word Cloud

Created with Highcharts 10.0.0GomemoryelPrepJavasequencingperformanceSAM/BAMheapimplementationthreeusingcountingruntimeusebenchmarksfilesgooddatatoolslanguageprogrammingconcurrentparallelgarbagehandC++17referencelargeobjectslanguagesbestsomewhatrunssignificantlyanalysiscollectionBACKGROUND:establishedmulti-threadedframeworkpreparingSAMBAMpipelinesachievesoftwarearchitecturemakessinglepassfilemultiplepreparationstepskeepsmuchpossiblemainSimilarmanagementcomplextaskbecameseriousproductivitybottleneckoriginalrecentdevelopmentthereforeinvestigatedalternativelanguages:collectoronehandlingamountsreimplementedbenchmarkedRESULTS:performsyieldingbalancereportfasterhigherrunslowershowsbettermanagingcaseCONCLUSIONS:Basedbenchmarkresultsselectednewrecommendconsideringcandidatedevelopingbioinformaticsprocessingwellcomparisonfull-fledgednext-generationtoolC++GarbageMemoryusageNext-generationReferenceRuntimeSequence

Similar Articles

Cited By