Density-based hierarchical clustering of pyro-sequences on a large scale--the case of fungal ITS1.

Marco Pagni, Hélène Niculita-Hirzel, Loïc Pellissier, Anne Dubuis, Ioannis Xenarios, Antoine Guisan, Ian R Sanders, Jérôme Goudet, Nicolas Guex
Author Information
  1. Marco Pagni: Vital-IT Group, SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland.

Abstract

MOTIVATION: Analysis of millions of pyro-sequences is currently playing a crucial role in the advance of environmental microbiology. Taxonomy-independent, i.e. unsupervised, clustering of these sequences is essential for the definition of Operational Taxonomic Units. For this application, reproducibility and robustness should be the most sought after qualities, but have thus far largely been overlooked.
RESULTS: More than 1 million hyper-variable internal transcribed spacer 1 (ITS1) sequences of fungal origin have been analyzed. The ITS1 sequences were first properly extracted from 454 reads using generalized profiles. Then, otupipe, cd-hit-454, ESPRIT-Tree and DBC454, a new algorithm presented here, were used to analyze the sequences. A numerical assay was developed to measure the reproducibility and robustness of these algorithms. DBC454 was the most robust, closely followed by ESPRIT-Tree. DBC454 features density-based hierarchical clustering, which complements the other methods by providing insights into the structure of the data.
AVAILABILITY: An executable is freely available for non-commercial users at ftp://ftp.vital-it.ch/tools/dbc454. It is designed to run under MPI on a cluster of 64-bit Linux machines running Red Hat 4.x, or on a multi-core OSX system.
CONTACT: dbc454@vital-it.ch or nicolas.guex@isb-sib.ch.

References

  1. BMC Bioinformatics. 2010 Apr 13;11:187 [PMID: 20388221]
  2. Nucleic Acids Res. 2011 Aug;39(14):e95 [PMID: 21596775]
  3. Mol Biol Evol. 2013 Apr;30(4):772-80 [PMID: 23329690]
  4. Brief Bioinform. 2012 Jan;13(1):107-21 [PMID: 21525143]
  5. Bioinformatics. 2010 Oct 1;26(19):2460-1 [PMID: 20709691]
  6. Brief Bioinform. 2001 Mar;2(1):51-67 [PMID: 11465063]
  7. Genome Biol. 2007;8(7):R143 [PMID: 17659080]
  8. Nat Methods. 2009 Sep;6(9):639-41 [PMID: 19668203]
  9. Protein Sci. 1994 Jan;3(1):139-46 [PMID: 7511453]
  10. Bioinformatics. 2011 Aug 15;27(16):2194-200 [PMID: 21700674]
  11. Brief Bioinform. 2002 Sep;3(3):265-74 [PMID: 12230035]
  12. Bioinformatics. 2012 Nov 15;28(22):2891-7 [PMID: 22962346]
  13. BMC Microbiol. 2010 Jul 09;10:189 [PMID: 20618939]
  14. Trends Genet. 2000 Jun;16(6):276-7 [PMID: 10827456]

MeSH Term

Algorithms
Cluster Analysis
DNA, Fungal
DNA, Ribosomal Spacer
Fungi
Reproducibility of Results
Soil Microbiology

Chemicals

DNA, Fungal
DNA, Ribosomal Spacer

Word Cloud

Created with Highcharts 10.0.0sequencesclusteringITS1DBC454pyro-sequencesreproducibilityrobustness1fungalESPRIT-TreehierarchicalchMOTIVATION:AnalysismillionscurrentlyplayingcrucialroleadvanceenvironmentalmicrobiologyTaxonomy-independentieunsupervisedessentialdefinitionOperationalTaxonomicUnitsapplicationsoughtqualitiesthusfarlargelyoverlookedRESULTS:millionhyper-variableinternaltranscribedspaceroriginanalyzedfirstproperlyextracted454readsusinggeneralizedprofilesotupipecd-hit-454newalgorithmpresentedusedanalyzenumericalassaydevelopedmeasurealgorithmsrobustcloselyfollowedfeaturesdensity-basedcomplementsmethodsprovidinginsightsstructuredataAVAILABILITY:executablefreelyavailablenon-commercialusersftp://ftpvital-itch/tools/dbc454designedrunMPIcluster64-bitLinuxmachinesrunningRedHat4xmulti-coreOSXsystemCONTACT:dbc454@vital-itnicolasguex@isb-sibDensity-basedlargescale--thecase

Similar Articles

Cited By