HAlign-II: efficient ultra-large multiple sequence alignment and phylogenetic tree reconstruction with distributed and parallel computing.

Shixiang Wan, Quan Zou
Author Information
  1. Shixiang Wan: School of Computer Science and Technology, Tianjin University, Tianjin, China.
  2. Quan Zou: School of Computer Science and Technology, Tianjin University, Tianjin, China.

Abstract

BACKGROUND: Multiple sequence alignment (MSA) plays a key role in biological sequence analyses, especially in phylogenetic tree construction. Extreme increase in next-generation sequencing results in shortage of efficient ultra-large biological sequence alignment approaches for coping with different sequence types.
METHODS: Distributed and parallel computing represents a crucial technique for accelerating ultra-large (e.g. files more than 1 GB) sequence analyses. Based on HAlign and Spark distributed computing system, we implement a highly cost-efficient and time-efficient HAlign-II tool to address ultra-large multiple biological sequence alignment and phylogenetic tree construction.
RESULTS: The experiments in the DNA and protein large scale data sets, which are more than 1GB files, showed that HAlign II could save time and space. It outperformed the current software tools. HAlign-II can efficiently carry out MSA and construct phylogenetic trees with ultra-large numbers of biological sequences. HAlign-II shows extremely high memory efficiency and scales well with increases in computing resource.
CONCLUSIONS: THAlign-II provides a user-friendly web server based on our distributed computing infrastructure. HAlign-II with open-source codes and datasets was established at http://lab.malab.cn/soft/halign.

Keywords

References

  1. Mol Syst Biol. 2011 Oct 11;7:539 [PMID: 21988835]
  2. Mol Biol Evol. 1987 Jul;4(4):406-25 [PMID: 3447015]
  3. Genome Res. 2002 Apr;12(4):656-64 [PMID: 11932250]
  4. Bioinformatics. 2012 Mar 15;28(6):876-7 [PMID: 22302568]
  5. Bioinformatics. 2014 Sep 15;30(18):2652-3 [PMID: 24845651]
  6. BMC Bioinformatics. 2010 Dec 21;11 Suppl 12:S1 [PMID: 21210976]
  7. BMC Bioinformatics. 2005 Dec 12;6:298 [PMID: 16343337]
  8. Genome Res. 2004 Oct;14(10A):1832-50 [PMID: 15466285]
  9. Nat Commun. 2013;4:2304 [PMID: 23942190]
  10. Nucleic Acids Res. 2002 Jul 15;30(14):3059-66 [PMID: 12136088]
  11. Nucleic Acids Res. 2004 Mar 19;32(5):1792-7 [PMID: 15034147]
  12. Proteins. 2005 Oct 1;61(1):127-36 [PMID: 16044462]
  13. Bioinformatics. 2014 May 1;30(9):1312-3 [PMID: 24451623]
  14. Bioinformatics. 2010 Sep 15;26(18):2321-7 [PMID: 20624778]
  15. Mol Biol Evol. 2016 Jul;33(7):1870-4 [PMID: 27004904]
  16. PLoS One. 2014 Jun 04;9(6):e98146 [PMID: 24897343]
  17. Evolution. 2012 Mar;66(3):763-775 [PMID: 22380439]
  18. Nucleic Acids Res. 2012 Mar;40(6):e41 [PMID: 22199254]
  19. Bioinformatics. 2001 Aug;17(8):713-20 [PMID: 11524372]
  20. Bioinformatics. 2011 Feb 15;27(4):592-3 [PMID: 21169378]
  21. BMC Bioinformatics. 2017 Jun 24;18(1):315 [PMID: 28646874]
  22. Bioinformatics. 2015 Aug 1;31(15):2475-81 [PMID: 25812743]
  23. Bioinformatics. 2013 Dec 1;29(23):3014-9 [PMID: 24021384]
  24. Bioinformatics. 2003 Aug 12;19(12):1585-6 [PMID: 12912844]
  25. Bioinformatics. 2004 May 1;20(7):1193-5 [PMID: 14764554]
  26. Bioinformatics. 2014 Jan 1;30(1):119-20 [PMID: 24149054]
  27. Mol Biol Evol. 2015 Jan;32(1):268-74 [PMID: 25371430]
  28. Protein Eng Des Sel. 2007 Oct;20(10):521-3 [PMID: 17720750]
  29. Curr Opin Struct Biol. 2006 Jun;16(3):368-73 [PMID: 16679011]
  30. J Mol Biol. 1981 Mar 25;147(1):195-7 [PMID: 7265238]

Word Cloud

Created with Highcharts 10.0.0sequencecomputingalignmentultra-largebiologicalphylogeneticHAlign-IItreedistributedMultipleMSAanalysesconstructionefficientDistributedparallelfilesHAlignSparkmultipletreesBACKGROUND:playskeyroleespeciallyExtremeincreasenext-generationsequencingresultsshortageapproachescopingdifferenttypesMETHODS:representscrucialtechniqueacceleratingeg1GBBasedsystemimplementhighlycost-efficienttime-efficienttooladdressRESULTS:experimentsDNAproteinlargescaledatasets1GBshowedIIsavetimespaceoutperformedcurrentsoftwaretoolscanefficientlycarryconstructnumberssequencesshowsextremelyhighmemoryefficiencyscaleswellincreasesresourceCONCLUSIONS:THAlign-IIprovidesuser-friendlywebserverbasedinfrastructureopen-sourcecodesdatasetsestablishedhttp://labmalabcn/soft/halignHAlign-II:reconstructionPhylogenetic

Similar Articles

Cited By