Sparse Markov chain-based semi-supervised multi-instance multi-label method for protein function prediction.

Advanced Search

Chao Han, Jian Chen, Qingyao Wu, Shuai Mu, Huaqing Min

Author Information

Chao Han: School of Software Engineering, South China University of Technology, Guangzhou, P. R. China.
Jian Chen: School of Software Engineering, South China University of Technology, Guangzhou, P. R. China.
Qingyao Wu: School of Software Engineering, South China University of Technology, Guangzhou, P. R. China.
Shuai Mu: School of Software Engineering, South China University of Technology, Guangzhou, P. R. China.
Huaqing Min: School of Software Engineering, South China University of Technology, Guangzhou, P. R. China.

PMID: 26493682 DOI: 10.1142/S0219720015430015

Automated assignment of protein function has received considerable attention in recent years for genome-wide study. With the rapid accumulation of genome sequencing data produced by high-throughput experimental techniques, the process of manually predicting functional properties of proteins has become increasingly cumbersome. Such large genomics data sets can only be annotated computationally. However, automated assignment of functions to unknown protein is challenging due to its inherent difficulty and complexity. Previous studies have revealed that solving problems involving complicated objects with multiple semantic meanings using the multi-instance multi-label (MIML) framework is effective. For the protein function prediction problems, each protein object in nature may associate with distinct structural units (instances) and multiple functional properties (class labels) where each unit is described by an instance and each functional property is considered as a class label. Thus, it is convenient and natural to tackle the protein function prediction problem by using the MIML framework. In this paper, we propose a sparse Markov chain-based semi-supervised MIML method, called Sparse-Markov. A sparse transductive probability graph is constructed to encode the affinity information of the data based on ensemble of Hausdorff distance metrics. Our goal is to exploit the affinity between protein objects in the sparse transductive probability graph to seek a sparse steady state probability of the Markov chain model to do protein function prediction, such that two proteins are given similar functional labels if they are close to each other in terms of an ensemble Hausdorff distance in the graph. Experimental results on seven real-world organism data sets covering three biological domains show that our proposed Sparse-Markov method is able to achieve better performance than four state-of-the-art MIML learning algorithms.

Hausdorff distance Markov chain Protein function prediction multi-instance multi-label learning semi-supervised learning

Algorithms

Animals

Computational Biology

Databases, Protein

Genome-Wide Association Study

Markov Chains

Proteins

Supervised Machine Learning

Proteins

Journal Article Research Support, Non-U.S. Gov't

OpenLB
Open Library of Bioscience