Large scale hierarchical clustering of protein sequences.

Advanced Search

Antje Krause, Jens Stoye, Martin Vingron

Author Information

Antje Krause: Max Planck Institute for Molecular Genetics, Computational Molecular Biology, Ihnestrasse 73, 14195 Berlin, Germany. akrause@igw.tfh-wildau.de

PMID: 15663796 DOI: 10.1186/1471-2105-6-15

BACKGROUND: Searching a biological sequence database with a query sequence looking for homologues has become a routine operation in computational biology. In spite of the high degree of sophistication of currently available search routines it is still virtually impossible to identify quickly and clearly a group of sequences that a given query sequence belongs to.
RESULTS: We report on our developments in grouping all known protein sequences hierarchically into superfamily and family clusters. Our graph-based algorithms take into account the topology of the sequence space induced by the data itself to construct a biologically meaningful partitioning. We have applied our clustering procedures to a non-redundant set of about 1,000,000 sequences resulting in a hierarchical clustering which is being made available for querying and browsing at http://systers.molgen.mpg.de/.
CONCLUSIONS: Comparisons with other widely used clustering methods on various data sets show the abilities and strengths of our clustering methods in producing a biologically meaningful grouping of protein sequences.

Nucleic Acids Res. 2000 Jan 1;28(1):49-55 [PMID: 10592179]
Nucleic Acids Res. 2004 Jan 1;32(Database issue):D112-4 [PMID: 14681371]
Nucleic Acids Res. 2000 Jan 1;28(1):304-5 [PMID: 10592255]
Nucleic Acids Res. 2003 Jan 1;31(1):365-70 [PMID: 12520024]
Nucleic Acids Res. 2003 Jan 1;31(1):224-8 [PMID: 12519987]
Nucleic Acids Res. 2003 Jan 1;31(1):348-52 [PMID: 12520020]
Nucleic Acids Res. 2002 Apr 1;30(7):1575-84 [PMID: 11917018]
Nucleic Acids Res. 2004 Jan 1;32(Database issue):D468-70 [PMID: 14681459]
Nucleic Acids Res. 2004 Jan 1;32(Database issue):D138-41 [PMID: 14681378]
Curr Opin Chem Biol. 2003 Feb;7(1):5-11 [PMID: 12547420]
Nucleic Acids Res. 2004 Jan 1;32(Database issue):D311-4 [PMID: 14681421]
Science. 2001 Nov 30;294(5548):1849-50 [PMID: 11729296]
Bioinformatics. 2004 Jan 22;20(2):243-52 [PMID: 14734316]
Nature. 2002 Feb 21;415(6874):871-80 [PMID: 11859360]
Bioinformatics. 1998 Jun;14(5):430-8 [PMID: 9682056]
Prog Biophys Mol Biol. 2000;73(5):321-37 [PMID: 11063778]
Nucleic Acids Res. 2001 Jan 1;29(1):52-4 [PMID: 11125047]
Nucleic Acids Res. 2003 Jan 1;31(1):388-9 [PMID: 12520029]
Proc Int Conf Intell Syst Mol Biol. 2000;8:307-16 [PMID: 10977092]
J Mol Biol. 1981 Mar 25;147(1):195-7 [PMID: 7265238]
Nucleic Acids Res. 2005 Jan 1;33(Database issue):D226-9 [PMID: 15608183]

Algorithms

Cluster Analysis

Computational Biology

Databases, Factual

Databases, Genetic

Databases, Nucleic Acid

Databases, Protein

Fungal Proteins

Genetic Linkage

Genome

Information Storage and Retrieval

Models, Biological

Multigene Family

Phylogeny

Protein Structure, Tertiary

Proteins

Proteomics

Reproducibility of Results

Sequence Alignment

Sequence Analysis, Protein

Software

Fungal Proteins

Proteins

Journal Article Research Support, Non-U.S. Gov't

Database Commons: DBC001151 (SYSTERS)

OpenLB
Open Library of Bioscience