Parallel clustering algorithm for large-scale biological data sets. - National Genomics Data Center (CNCB-NGDC)

Advanced Search

Parallel clustering algorithm for large-scale biological data sets.

Minchao Wang, Wu Zhang, Wang Ding, Dongbo Dai, Huiran Zhang, Hao Xie, Luonan Chen, Yike Guo, Jiang Xie

Author Information

Minchao Wang: School of Computer Engineering and Science, Shanghai University, Shanghai, P.R.China.
Wu Zhang: School of Computer Engineering and Science, Shanghai University, Shanghai, P.R.China; High Performance Computing Center, Shanghai University, Shanghai, P.R.China.
Wang Ding: School of Computer Engineering and Science, Shanghai University, Shanghai, P.R.China.
Dongbo Dai: School of Computer Engineering and Science, Shanghai University, Shanghai, P.R.China.
Huiran Zhang: School of Computer Engineering and Science, Shanghai University, Shanghai, P.R.China.
Hao Xie: College of Stomatology, Wuhan University, Wuhan, P.R.China.
Luonan Chen: School of Computer Engineering and Science, Shanghai University, Shanghai, P.R.China; Key Laboratory of Systems Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, P.R.China.
Yike Guo: School of Computer Engineering and Science, Shanghai University, Shanghai, P.R.China; Department of Computing, Imperial College London, London, United Kingdom.
Jiang Xie: School of Computer Engineering and Science, Shanghai University, Shanghai, P.R.China.

PMID: 24705246 DOI: 10.1371/journal.pone.0091315

BACKGROUNDS: Recent explosion of biological data brings a great challenge for the traditional clustering algorithms. With increasing scale of data sets, much larger memory and longer runtime are required for the cluster identification problems. The affinity propagation algorithm outperforms many other classical clustering algorithms and is widely applied into the biological researches. However, the time and space complexity become a great bottleneck when handling the large-scale data sets. Moreover, the similarity matrix, whose constructing procedure takes long runtime, is required before running the affinity propagation algorithm, since the algorithm clusters data sets based on the similarities between data pairs.
METHODS: Two types of parallel architectures are proposed in this paper to accelerate the similarity matrix constructing procedure and the affinity propagation algorithm. The memory-shared architecture is used to construct the similarity matrix, and the distributed system is taken for the affinity propagation algorithm, because of its large memory size and great computing capacity. An appropriate way of data partition and reduction is designed in our method, in order to minimize the global communication cost among processes.
RESULT: A speedup of 100 is gained with 128 cores. The runtime is reduced from serval hours to a few seconds, which indicates that parallel algorithm is capable of handling large-scale data sets effectively. The parallel affinity propagation also achieves a good performance when clustering large-scale gene data (microarray) and detecting families in large protein superfamilies.

Comput Oper Res. 2012 Dec;39(12):3046-3061 [PMID: 23144527]
BMC Bioinformatics. 2006 Nov 06;7:488 [PMID: 17087821]
PLoS One. 2008 Feb 27;3(2):e1662 [PMID: 18301740]
Nucleic Acids Res. 2006 Mar 17;34(5):1571-80 [PMID: 16547200]
BMC Bioinformatics. 2011 Jun 02;12:224 [PMID: 21635750]
Nucleic Acids Res. 2002 Apr 1;30(7):1575-84 [PMID: 11917018]
BMC Syst Biol. 2010 May 11;4:59 [PMID: 20459825]
Bioinformatics. 2002;18 Suppl 1:S145-54 [PMID: 12169542]
Bioinformatics. 2003 May 22;19(8):973-80 [PMID: 12761060]
Science. 2007 Feb 16;315(5814):972-6 [PMID: 17218491]
IEEE/ACM Trans Comput Biol Bioinform. 2012 May-Jun;9(3):679-92 [PMID: 21483031]
Nature. 2005 Feb 24;433(7028):895-900 [PMID: 15729348]
Fed Proc. 1976 Aug;35(10):2132-8 [PMID: 181273]
Nucleic Acids Res. 2002 Jan 1;30(1):207-10 [PMID: 11752295]
Int J Knowl Eng Soft Data Paradig. 2009 Jan 1;1(1):40-48 [PMID: 20057919]
Biochemistry. 2005 May 3;44(17):6383-91 [PMID: 15850372]
Biochemistry. 2006 Feb 28;45(8):2545-55 [PMID: 16489747]
Bioinformatics. 2007 Oct 15;23(20):2708-15 [PMID: 17895277]
Nat Genet. 2000 May;25(1):25-9 [PMID: 10802651]
Phys Rev Lett. 1996 Apr 29;76(18):3251-3254 [PMID: 10060920]
Bioinformatics. 2002 Jan;18(1):207-8 [PMID: 11836235]
Biochemistry. 1996 Dec 24;35(51):16489-501 [PMID: 8987982]
Bioinformatics. 2009 Dec 1;25(23):3143-50 [PMID: 19770263]
Placenta. 2011 Oct;32(10):763-70 [PMID: 21803418]
J Immunol. 2004 Dec 15;173(12):7141-9 [PMID: 15585835]
PLoS One. 2013 Apr 23;8(4):e61533 [PMID: 23626696]
J Mol Biol. 1999 Apr 23;288(1):147-64 [PMID: 10329133]
Bioinformatics. 2010 Feb 1;26(3):355-62 [PMID: 19996165]
Circulation. 2007 Mar 27;115(12):1551-62 [PMID: 17353439]
IEEE Trans Pattern Anal Mach Intell. 2011 Mar;33(3):568-86 [PMID: 20421667]
FEBS Lett. 2002 Jul 3;522(1-3):24-8 [PMID: 12095613]
Comput Biol Med. 2008 Mar;38(3):283-93 [PMID: 18061589]
J Chem Inf Model. 2011 Dec 27;51(12):3158-68 [PMID: 22098146]
BMC Bioinformatics. 2003 Jan 13;4:2 [PMID: 12525261]

Algorithms

Cluster Analysis

Computational Biology

Databases, Protein

Datasets as Topic

Gene Expression Profiling

High-Throughput Screening Assays

Humans

Microarray Analysis

Multigene Family

Proteins

Proteomics

Proteins

Journal Article Research Support, Non-U.S. Gov't Validation Study