Large scale hierarchical clustering of protein sequences.

Antje Krause, Jens Stoye, Martin Vingron
Author Information
  1. Antje Krause: Max Planck Institute for Molecular Genetics, Computational Molecular Biology, Ihnestrasse 73, 14195 Berlin, Germany. akrause@igw.tfh-wildau.de

Abstract

BACKGROUND: Searching a biological sequence database with a query sequence looking for homologues has become a routine operation in computational biology. In spite of the high degree of sophistication of currently available search routines it is still virtually impossible to identify quickly and clearly a group of sequences that a given query sequence belongs to.
RESULTS: We report on our developments in grouping all known protein sequences hierarchically into superfamily and family clusters. Our graph-based algorithms take into account the topology of the sequence space induced by the data itself to construct a biologically meaningful partitioning. We have applied our clustering procedures to a non-redundant set of about 1,000,000 sequences resulting in a hierarchical clustering which is being made available for querying and browsing at http://systers.molgen.mpg.de/.
CONCLUSIONS: Comparisons with other widely used clustering methods on various data sets show the abilities and strengths of our clustering methods in producing a biologically meaningful grouping of protein sequences.

References

  1. Nucleic Acids Res. 2000 Jan 1;28(1):49-55 [PMID: 10592179]
  2. Nucleic Acids Res. 2004 Jan 1;32(Database issue):D112-4 [PMID: 14681371]
  3. Nucleic Acids Res. 2000 Jan 1;28(1):304-5 [PMID: 10592255]
  4. Nucleic Acids Res. 2003 Jan 1;31(1):365-70 [PMID: 12520024]
  5. Nucleic Acids Res. 2003 Jan 1;31(1):224-8 [PMID: 12519987]
  6. Nucleic Acids Res. 2003 Jan 1;31(1):348-52 [PMID: 12520020]
  7. Nucleic Acids Res. 2002 Apr 1;30(7):1575-84 [PMID: 11917018]
  8. Nucleic Acids Res. 2004 Jan 1;32(Database issue):D468-70 [PMID: 14681459]
  9. Nucleic Acids Res. 2004 Jan 1;32(Database issue):D138-41 [PMID: 14681378]
  10. Curr Opin Chem Biol. 2003 Feb;7(1):5-11 [PMID: 12547420]
  11. Nucleic Acids Res. 2004 Jan 1;32(Database issue):D311-4 [PMID: 14681421]
  12. Science. 2001 Nov 30;294(5548):1849-50 [PMID: 11729296]
  13. Bioinformatics. 2004 Jan 22;20(2):243-52 [PMID: 14734316]
  14. Nature. 2002 Feb 21;415(6874):871-80 [PMID: 11859360]
  15. Bioinformatics. 1998 Jun;14(5):430-8 [PMID: 9682056]
  16. Prog Biophys Mol Biol. 2000;73(5):321-37 [PMID: 11063778]
  17. Nucleic Acids Res. 2001 Jan 1;29(1):52-4 [PMID: 11125047]
  18. Nucleic Acids Res. 2003 Jan 1;31(1):388-9 [PMID: 12520029]
  19. Proc Int Conf Intell Syst Mol Biol. 2000;8:307-16 [PMID: 10977092]
  20. J Mol Biol. 1981 Mar 25;147(1):195-7 [PMID: 7265238]
  21. Nucleic Acids Res. 2005 Jan 1;33(Database issue):D226-9 [PMID: 15608183]

MeSH Term

Algorithms
Cluster Analysis
Computational Biology
Databases, Factual
Databases, Genetic
Databases, Nucleic Acid
Databases, Protein
Fungal Proteins
Genetic Linkage
Genome
Information Storage and Retrieval
Models, Biological
Multigene Family
Phylogeny
Protein Structure, Tertiary
Proteins
Proteomics
Reproducibility of Results
Sequence Alignment
Sequence Analysis, Protein
Software

Chemicals

Fungal Proteins
Proteins

Links to CNCB-NGDC Resources

Database Commons: DBC001151 (SYSTERS)

Word Cloud

Created with Highcharts 10.0.0sequencesclusteringsequenceproteinqueryavailablegroupingdatabiologicallymeaningful000hierarchicalmethodsBACKGROUND:SearchingbiologicaldatabaselookinghomologuesbecomeroutineoperationcomputationalbiologyspitehighdegreesophisticationcurrentlysearchroutinesstillvirtuallyimpossibleidentifyquicklyclearlygroupgivenbelongstoRESULTS:reportdevelopmentsknownhierarchicallysuperfamilyfamilyclustersgraph-basedalgorithmstakeaccounttopologyspaceinducedconstructpartitioningappliedproceduresnon-redundantset1resultingmadequeryingbrowsinghttp://systersmolgenmpgde/CONCLUSIONS:ComparisonswidelyusedvarioussetsshowabilitiesstrengthsproducingLargescale

Similar Articles

Cited By