BACKGROUND: Searching a biological sequence database with a query sequence looking for homologues has become a routine operation in computational biology. In spite of the high degree of sophistication of currently available search routines it is still virtually impossible to identify quickly and clearly a group of sequences that a given query sequence belongs to. RESULTS: We report on our developments in grouping all known protein sequences hierarchically into superfamily and family clusters. Our graph-based algorithms take into account the topology of the sequence space induced by the data itself to construct a biologically meaningful partitioning. We have applied our clustering procedures to a non-redundant set of about 1,000,000 sequences resulting in a hierarchical clustering which is being made available for querying and browsing at http://systers.molgen.mpg.de/. CONCLUSIONS: Comparisons with other widely used clustering methods on various data sets show the abilities and strengths of our clustering methods in producing a biologically meaningful grouping of protein sequences.
References
Nucleic Acids Res. 2000 Jan 1;28(1):49-55
[PMID: 10592179]
Nucleic Acids Res. 2004 Jan 1;32(Database issue):D112-4
[PMID: 14681371]
Nucleic Acids Res. 2000 Jan 1;28(1):304-5
[PMID: 10592255]
Nucleic Acids Res. 2003 Jan 1;31(1):365-70
[PMID: 12520024]
Nucleic Acids Res. 2003 Jan 1;31(1):224-8
[PMID: 12519987]
Nucleic Acids Res. 2003 Jan 1;31(1):348-52
[PMID: 12520020]