Accelerated similarity searching and clustering of large compound sets by geometric embedding and locality sensitive hashing.

Yiqun Cao, Tao Jiang, Thomas Girke
Author Information
  1. Yiqun Cao: Department of Computer Science & Engineering and Department of Botany & Plant Sciences, University of California, Riverside, CA 92521, USA.

Abstract

MOTIVATION: Similarity searching and clustering of chemical compounds by structural similarities are important computational approaches for identifying drug-like small molecules. Most algorithms available for these tasks are limited by their speed and scalability, and cannot handle today's large compound databases with several million entries.
RESULTS: In this article, we introduce a new algorithm for accelerated similarity searching and clustering of very large compound sets using embedding and indexing (EI) techniques. First, we present EI-Search as a general purpose similarity search method for finding objects with similar features in large databases and apply it here to searching and clustering of large compound sets. The method embeds the compounds in a high-dimensional Euclidean space and searches this space using an efficient index-aware nearest neighbor search method based on locality sensitive hashing (LSH). Second, to cluster large compound sets, we introduce the EI-Clustering algorithm that combines the EI-Search method with Jarvis-Patrick clustering. Both methods were tested on three large datasets with sizes ranging from about 260 000 to over 19 million compounds. In comparison to sequential search methods, the EI-Search method was 40-200 times faster, while maintaining comparable recall rates. The EI-Clustering method allowed us to significantly reduce the CPU time required to cluster these large compound libraries from several months to only a few days.
AVAILABILITY: Software implementations and online services have been developed based on the methods introduced in this study. The online services provide access to the generated clustering results and ultra-fast similarity searching of the PubChem Compound database with subsecond response time.

References

J Chem Inf Model. 2006 Jan-Feb;46(1):201-7 [PMID: 16426056]
Plant Physiol. 2005 Jun;138(2):573-7 [PMID: 15955920]
Bioinformatics. 2007 Sep 1;23(17):2348-51 [PMID: 17599932]
Nat Biotechnol. 2007 Jan;25(1):71-5 [PMID: 17211405]
J Chem Inf Comput Sci. 2003 Nov-Dec;43(6):1933-41 [PMID: 14632443]
J Chem Inf Comput Sci. 2002 Jan-Feb;42(1):46-57 [PMID: 11855965]
Curr Opin Chem Biol. 2002 Jun;6(3):384-9 [PMID: 12023120]
Science. 2004 Nov 12;306(5699):1138-9 [PMID: 15542455]
Science. 2000 Dec 22;290(5500):2319-23 [PMID: 11125149]
J Mol Graph Model. 2003 Mar;21(5):421-33 [PMID: 12543138]
Nat Chem Biol. 2007 Aug;3(8):447-50 [PMID: 17637771]
Bioinformatics. 2008 Jul 1;24(13):i366-74 [PMID: 18586736]
J Med Chem. 2005 Jun 30;48(13):4183-99 [PMID: 15974568]
J Chem Inf Model. 2005 Jan-Feb;45(1):177-82 [PMID: 15667143]
J Chem Inf Model. 2007 Mar-Apr;47(2):302-17 [PMID: 17326616]
Curr Opin Chem Biol. 2004 Aug;8(4):412-7 [PMID: 15288252]
Science. 2003 Apr 11;300(5617):294-5 [PMID: 12690189]
Proc Natl Acad Sci U S A. 2002 Dec 10;99(25):15869-72 [PMID: 12444256]
J Chem Inf Model. 2008 Jul;48(7):1367-78 [PMID: 18593143]
Nucleic Acids Res. 2008 Jan;36(Database issue):D351-9 [PMID: 17947324]
Science. 2000 Dec 22;290(5500):2323-6 [PMID: 11125150]
Drug Discov Today. 2002 Sep 1;7(17):903-11 [PMID: 12546933]
J Comput Chem. 2003 Jul 30;24(10):1215-21 [PMID: 12820129]
J Chem Inf Comput Sci. 2002 Nov-Dec;42(6):1407-14 [PMID: 12444738]
Curr Opin Chem Biol. 2005 Jun;9(3):296-303 [PMID: 15939332]

MeSH Term

Cluster Analysis
Computational Biology
Information Storage and Retrieval
Pattern Recognition, Automated

Word Cloud

Similar Articles

Cited By