Detecting anomalous proteins using deep representations.

Tomer Michael-Pitschaze, Niv Cohen, Dan Ofer, Yedid Hoshen, Michal Linial
Author Information
  1. Tomer Michael-Pitschaze: The Rachel and Selim Benin School of Computer Science and Engineering, The Hebrew University of Jerusalem, Jerusalem, Israel.
  2. Niv Cohen: The Rachel and Selim Benin School of Computer Science and Engineering, The Hebrew University of Jerusalem, Jerusalem, Israel.
  3. Dan Ofer: Department of Biological Chemistry, Institute of Life Sciences, The Hebrew University of Jerusalem, Jerusalem, Israel. ORCID
  4. Yedid Hoshen: The Rachel and Selim Benin School of Computer Science and Engineering, The Hebrew University of Jerusalem, Jerusalem, Israel.
  5. Michal Linial: Department of Biological Chemistry, Institute of Life Sciences, The Hebrew University of Jerusalem, Jerusalem, Israel. ORCID

Abstract

Many advances in biomedicine can be attributed to identifying unusual proteins and genes. Many of these proteins' unique properties were discovered by manual inspection, which is becoming infeasible at the scale of modern protein datasets. Here, we propose to tackle this challenge using anomaly detection methods that automatically identify unexpected properties. We adopt a state-of-the-art anomaly detection paradigm from computer vision, to highlight unusual proteins. We generate meaningful representations without labeled inputs, using pretrained deep neural network models. We apply these protein language models (pLM) to detect anomalies in function, phylogenetic families, and segmentation tasks. We compute protein anomaly scores to highlight human prion-like proteins, distinguish viral proteins from their host proteome, and mark non-classical ion/metal binding proteins and enzymes. Other tasks concern segmentation of protein sequences into folded and unstructured regions. We provide candidates for rare functionality (e.g. prion proteins). Additionally, we show the anomaly score is useful in 3D folding-related segmentation. Our novel method shows improved performance over strong baselines and has objectively high performance across a variety of tasks. We conclude that the combination of pLM and anomaly detection techniques is a valid method for discovering a range of global and local protein characteristics.

References

  1. Elife. 2020 Feb 19;9: [PMID: 32072921]
  2. Trends Cell Biol. 2010 Mar;20(3):125-33 [PMID: 20071174]
  3. Nat Biotechnol. 2022 Nov;40(11):1617-1623 [PMID: 36192636]
  4. Biochim Biophys Acta. 2010 Jun;1804(6):1231-64 [PMID: 20117254]
  5. BMC Bioinformatics. 2004 Nov 18;5:178 [PMID: 15550167]
  6. PLoS Comput Biol. 2012 Feb;8(2):e1002364 [PMID: 22319434]
  7. Curr Opin Struct Biol. 2009 Feb;19(1):14-22 [PMID: 19157856]
  8. Nat Rev Mol Cell Biol. 2010 Dec;11(12):823-33 [PMID: 21081963]
  9. PLoS Biol. 2020 Mar 12;18(3):e3000632 [PMID: 32163402]
  10. Curr Protoc Bioinformatics. 2016 Jun 20;54:5.6.1-5.6.37 [PMID: 27322406]
  11. J Bacteriol. 2010 Jan;192(1):46-58 [PMID: 19734316]
  12. Nucleic Acids Res. 2005 Jul 20;33(13):4035-9 [PMID: 16034025]
  13. Biochim Biophys Acta. 2013 May;1834(5):918-31 [PMID: 23328411]
  14. Nature. 2021 Aug;596(7873):590-596 [PMID: 34293799]
  15. Cell. 2016 Oct 06;167(2):369-381.e12 [PMID: 27693355]
  16. Biochem J. 2009 Dec 14;425(1):1-11 [PMID: 20001958]
  17. Toxins (Basel). 2017 Oct 29;9(11): [PMID: 29109389]
  18. APMIS. 2016 Jan-Feb;124(1-2):44-51 [PMID: 26818261]
  19. Nat Rev Genet. 2009 Oct;10(10):715-24 [PMID: 19763154]
  20. Hum Mol Genet. 2013 Feb 15;22(4):668-84 [PMID: 23136128]
  21. Bioinformatics. 2008 Mar 1;24(5):613-20 [PMID: 18174181]
  22. Multimed Tools Appl. 2023;82(3):3713-3744 [PMID: 35855771]
  23. Comput Struct Biotechnol J. 2021 Mar 25;19:1750-1758 [PMID: 33897979]
  24. PLoS Comput Biol. 2008 Oct;4(10):e1000173 [PMID: 18974822]
  25. Nat Rev Genet. 2011 Aug 31;12(10):692-702 [PMID: 21878963]
  26. Nucleic Acids Res. 2022 Jan 7;50(D1):D439-D444 [PMID: 34791371]
  27. PLoS Comput Biol. 2017 Apr 10;13(4):e1005465 [PMID: 28394888]
  28. J Neurosci. 2002 Jun 15;22(12):4833-41 [PMID: 12077180]
  29. Genome Biol. 2002;3(2):COMMENT2001 [PMID: 11864365]
  30. IEEE Trans Pattern Anal Mach Intell. 2022 Oct;44(10):7112-7127 [PMID: 34232869]
  31. Nature. 1994 Dec 15;372(6507):631-4 [PMID: 7990952]
  32. Front Microbiol. 2015 Jun 05;6:563 [PMID: 26097471]
  33. Nat Biotechnol. 2012 Nov;30(11):1072-80 [PMID: 23138306]
  34. Drug Discov Today. 2005 Nov 1;10(21):1475-82 [PMID: 16243268]
  35. Proc Natl Acad Sci U S A. 2021 Apr 13;118(15): [PMID: 33876751]
  36. Proc Int Conf Web Search Data Min. 2022 Feb;2022:1300-1309 [PMID: 35647617]
  37. Nucleic Acids Res. 2021 Jan 8;49(D1):D480-D489 [PMID: 33237286]
  38. Nat Methods. 2013 Mar;10(3):221-7 [PMID: 23353650]
  39. Brief Bioinform. 2006 Sep;7(3):225-42 [PMID: 16772267]
  40. J Mol Biol. 2007 Jun 1;369(2):553-66 [PMID: 17433819]
  41. Bioinformatics. 2022 Apr 12;38(8):2102-2110 [PMID: 35020807]

Word Cloud

Created with Highcharts 10.0.0proteinsproteinanomalyusingdetectionsegmentationtasksManyunusualpropertieshighlightrepresentationsdeepmodelspLMmethodperformanceadvancesbiomedicinecanattributedidentifyinggenesproteins'uniquediscoveredmanualinspectionbecominginfeasiblescalemoderndatasetsproposetacklechallengemethodsautomaticallyidentifyunexpectedadoptstate-of-the-artparadigmcomputervisiongeneratemeaningfulwithoutlabeledinputspretrainedneuralnetworkapplylanguagedetectanomaliesfunctionphylogeneticfamiliescomputescoreshumanprion-likedistinguishviralhostproteomemarknon-classicalion/metalbindingenzymesconcernsequencesfoldedunstructuredregionsprovidecandidatesrarefunctionalityegprionAdditionallyshowscoreuseful3Dfolding-relatednovelshowsimprovedstrongbaselinesobjectivelyhighacrossvarietyconcludecombinationtechniquesvaliddiscoveringrangegloballocalcharacteristicsDetectinganomalous

Similar Articles

Cited By