The probability of edge existence due to node degree: a baseline for network-based predictions.

Michael Zietz, Daniel S Himmelstein, Kyle Kloster, Christopher Williams, Michael W Nagle, Casey S Greene
Author Information
  1. Michael Zietz: Department of Physics & Astronomy, University of Pennsylvania, Philadelphia, Pennsylvania, United States of America; Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, Philadelphia, Pennsylvania, United States of America. ORCID
  2. Daniel S Himmelstein: Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, Philadelphia, Pennsylvania, United States of America. ORCID
  3. Kyle Kloster: Department of Computer Science, North Carolina State University, Raleigh, North Carolina, United States of America. ORCID
  4. Christopher Williams: Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, Philadelphia, Pennsylvania, United States of America. ORCID
  5. Michael W Nagle: Internal Medicine Research Unit, Pfizer Worldwide Research, Development, and Medical. ORCID
  6. Casey S Greene: Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, Philadelphia, Pennsylvania, United States of America. ORCID

Abstract

Important tasks in biomedical discovery such as predicting gene functions, gene-disease associations, and drug repurposing opportunities are often framed as network edge prediction. The number of edges connecting to a node, termed degree, can vary greatly across nodes in real biomedical networks, and the distribution of degrees varies between networks. If degree strongly influences edge prediction, then imbalance or bias in the distribution of degrees could lead to nonspecific or misleading predictions. We introduce a network permutation framework to quantify the effects of node degree on edge prediction. Our framework decomposes performance into the proportions attributable to degree and the network's specific connections. We discover that performance attributable to factors other than degree is often only a small portion of overall performance. Degree's predictive performance diminishes when the networks used for training and testing-despite measuring the same biological relationships-were generated using distinct techniques and hence have large differences in degree distribution. We introduce the permutation-derived edge prior as the probability that an edge exists based only on degree. The edge prior shows excellent discrimination and calibration for 20 biomedical networks (16 bipartite, 3 undirected, 1 directed), with AUROCs frequently exceeding 0.85. Researchers seeking to predict new or missing edges in biological networks should use the edge prior as a baseline to identify the fraction of performance that is nonspecific because of degree. We released our methods as an open-source Python package (https://github.com/hetio/xswap/).

References

  1. PLoS One. 2011 Mar 03;6(3):e17645 [PMID: 21390231]
  2. Cell. 2014 Nov 20;159(5):1212-1226 [PMID: 25416956]
  3. Elife. 2019 Apr 24;8: [PMID: 31017570]
  4. Bioinformatics. 2010 Oct 1;26(19):2438-44 [PMID: 20709693]
  5. Gigascience. 2022 Dec 28;12: [PMID: 37503959]
  6. Science. 1999 Oct 15;286(5439):509-12 [PMID: 10521342]
  7. Nucleic Acids Res. 2019 Jan 8;47(D1):D607-D613 [PMID: 30476243]
  8. PLoS Comput Biol. 2015 Jul 09;11(7):e1004259 [PMID: 26158728]
  9. Nat Biotechnol. 2005 Jul;23(7):839-44 [PMID: 16003372]
  10. Front Genet. 2015 Aug 04;6:260 [PMID: 26300911]
  11. Nature. 2005 Oct 20;437(7062):1173-8 [PMID: 16189514]
  12. Methods Mol Biol. 2012;804:245-62 [PMID: 22144157]
  13. Nat Commun. 2019 Mar 4;10(1):1017 [PMID: 30833554]
  14. Bioinformatics. 2015 Sep 1;31(17):2836-43 [PMID: 25910697]
  15. J Proteomics. 2014 Apr 04;100:44-54 [PMID: 24480284]
  16. PLoS One. 2011 Feb 18;6(2):e17258 [PMID: 21364756]
  17. PLoS Comput Biol. 2019 Jun 24;15(6):e1007128 [PMID: 31233491]
  18. Elife. 2017 Sep 22;6: [PMID: 28936969]
  19. Nucleic Acids Res. 2018 Jan 4;46(D1):D380-D386 [PMID: 29087512]
  20. BMC Syst Biol. 2008 Jan 31;2:11 [PMID: 18237403]

Grants

  1. R01 HG010067/NHGRI NIH HHS

Word Cloud

Created with Highcharts 10.0.0edgedegreenetworksperformancebiomedicalpredictionnodedistributionprioroftennetworkedgesdegreesnonspecificpredictionsintroduceframeworkattributablebiologicalprobabilitybaselineImportanttasksdiscoverypredictinggenefunctionsgene-diseaseassociationsdrugrepurposingopportunitiesframednumberconnectingtermedcanvarygreatlyacrossnodesrealvariesstronglyinfluencesimbalancebiasleadmisleadingpermutationquantifyeffectsdecomposesproportionsnetwork'sspecificconnectionsdiscoverfactorssmallportionoverallDegree'spredictivediminishesusedtrainingtesting-despitemeasuringrelationships-weregeneratedusingdistincttechniqueshencelargedifferencespermutation-derivedexistsbasedshowsexcellentdiscriminationcalibration2016bipartite3undirected1directedAUROCsfrequentlyexceeding085Researchersseekingpredictnewmissinguseidentifyfractionreleasedmethodsopen-sourcePythonpackagehttps://githubcom/hetio/xswap/existenceduedegree:network-based

Similar Articles

Cited By