Sparse Markov chain-based semi-supervised multi-instance multi-label method for protein function prediction.

Chao Han, Jian Chen, Qingyao Wu, Shuai Mu, Huaqing Min
Author Information
  1. Chao Han: School of Software Engineering, South China University of Technology, Guangzhou, P. R. China.
  2. Jian Chen: School of Software Engineering, South China University of Technology, Guangzhou, P. R. China.
  3. Qingyao Wu: School of Software Engineering, South China University of Technology, Guangzhou, P. R. China.
  4. Shuai Mu: School of Software Engineering, South China University of Technology, Guangzhou, P. R. China.
  5. Huaqing Min: School of Software Engineering, South China University of Technology, Guangzhou, P. R. China.

Abstract

Automated assignment of protein function has received considerable attention in recent years for genome-wide study. With the rapid accumulation of genome sequencing data produced by high-throughput experimental techniques, the process of manually predicting functional properties of proteins has become increasingly cumbersome. Such large genomics data sets can only be annotated computationally. However, automated assignment of functions to unknown protein is challenging due to its inherent difficulty and complexity. Previous studies have revealed that solving problems involving complicated objects with multiple semantic meanings using the multi-instance multi-label (MIML) framework is effective. For the protein function prediction problems, each protein object in nature may associate with distinct structural units (instances) and multiple functional properties (class labels) where each unit is described by an instance and each functional property is considered as a class label. Thus, it is convenient and natural to tackle the protein function prediction problem by using the MIML framework. In this paper, we propose a sparse Markov chain-based semi-supervised MIML method, called Sparse-Markov. A sparse transductive probability graph is constructed to encode the affinity information of the data based on ensemble of Hausdorff distance metrics. Our goal is to exploit the affinity between protein objects in the sparse transductive probability graph to seek a sparse steady state probability of the Markov chain model to do protein function prediction, such that two proteins are given similar functional labels if they are close to each other in terms of an ensemble Hausdorff distance in the graph. Experimental results on seven real-world organism data sets covering three biological domains show that our proposed Sparse-Markov method is able to achieve better performance than four state-of-the-art MIML learning algorithms.

Keywords

MeSH Term

Algorithms
Animals
Computational Biology
Databases, Protein
Genome-Wide Association Study
Markov Chains
Proteins
Supervised Machine Learning

Chemicals

Proteins

Word Cloud

Created with Highcharts 10.0.0proteinfunctionpredictiondatafunctionalMIMLsparseMarkovmulti-instancemulti-labelsemi-supervisedmethodprobabilitygraphHausdorffdistancelearningassignmentpropertiesproteinssetsproblemsobjectsmultipleusingframeworkclasslabelschain-basedSparse-MarkovtransductiveaffinityensemblechainAutomatedreceivedconsiderableattentionrecentyearsgenome-widestudyrapidaccumulationgenomesequencingproducedhigh-throughputexperimentaltechniquesprocessmanuallypredictingbecomeincreasinglycumbersomelargegenomicscanannotatedcomputationallyHoweverautomatedfunctionsunknownchallengingdueinherentdifficultycomplexityPreviousstudiesrevealedsolvinginvolvingcomplicatedsemanticmeaningseffectiveobjectnaturemayassociatedistinctstructuralunitsinstancesunitdescribedinstancepropertyconsideredlabelThusconvenientnaturaltackleproblempaperproposecalledconstructedencodeinformationbasedmetricsgoalexploitseeksteadystatemodeltwogivensimilarclosetermsExperimentalresultssevenreal-worldorganismcoveringthreebiologicaldomainsshowproposedableachievebetterperformancefourstate-of-the-artalgorithmsSparseProtein

Similar Articles

Cited By