Cross-modality and self-supervised protein embedding for compound-protein affinity and contact prediction.

Yuning You, Yang Shen
Author Information
  1. Yuning You: Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 77843, USA.
  2. Yang Shen: Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 77843, USA. ORCID

Abstract

MOTIVATION: Computational methods for compound-protein affinity and contact (CPAC) prediction aim at facilitating rational drug discovery by simultaneous prediction of the strength and the pattern of compound-protein interactions. Although the desired outputs are highly structure-dependent, the lack of protein structures often makes structure-free methods rely on protein sequence inputs alone. The scarcity of compound-protein pairs with affinity and contact labels further limits the accuracy and the generalizability of CPAC models.
RESULTS: To overcome the aforementioned challenges of structure naivety and labeled-data scarcity, we introduce cross-modality and self-supervised learning, respectively, for structure-aware and task-relevant protein embedding. Specifically, protein data are available in both modalities of 1D amino-acid sequences and predicted 2D contact maps that are separately embedded with recurrent and graph neural networks, respectively, as well as jointly embedded with two cross-modality schemes. Furthermore, both protein modalities are pre-trained under various self-supervised learning strategies, by leveraging massive amount of unlabeled protein data. Our results indicate that individual protein modalities differ in their strengths of predicting affinities or contacts. Proper cross-modality protein embedding combined with self-supervised learning improves model generalizability when predicting both affinities and contacts for unseen proteins.
AVAILABILITY AND IMPLEMENTATION: Data and source codes are available at https://github.com/Shen-Lab/CPAC.
SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

References

  1. Bioinformatics. 2018 Sep 1;34(17):i821-i829 [PMID: 30423097]
  2. PLoS One. 2011 Apr 27;6(4):e18910 [PMID: 21556138]
  3. Proc Natl Acad Sci U S A. 2019 Aug 20;116(34):16856-16865 [PMID: 31399549]
  4. Med Res Rev. 1996 Jan;16(1):3-50 [PMID: 8788213]
  5. Nucleic Acids Res. 2000 Jan 1;28(1):235-42 [PMID: 10592235]
  6. Nat Rev Drug Discov. 2017 Jan;16(1):19-34 [PMID: 27910877]
  7. Nature. 2021 Aug;596(7873):583-589 [PMID: 34265844]
  8. Bioinformatics. 2019 Jan 15;35(2):309-318 [PMID: 29982330]
  9. Nucleic Acids Res. 2021 Jan 8;49(D1):D412-D419 [PMID: 33125078]
  10. Protein Sci. 2018 Jan;27(1):129-134 [PMID: 28875543]
  11. RSC Adv. 2020 Jun 1;10(35):20701-20712 [PMID: 35517730]
  12. Science. 2021 Aug 20;373(6557):871-876 [PMID: 34282049]
  13. Bioinformatics. 2015 Feb 1;31(3):405-12 [PMID: 25301850]
  14. Proteins. 2020 Aug;88(8):1091-1099 [PMID: 32144844]
  15. Bioinformatics. 2022 Sep 16;38(Suppl_2):ii68-ii74 [PMID: 36124802]
  16. Bioinformatics. 2019 Sep 15;35(18):3329-3338 [PMID: 30768156]
  17. Proc Mach Learn Res. 2020 Jul;119:10871-10880 [PMID: 33283198]
  18. J Chem Inf Model. 2021 Jan 25;61(1):46-66 [PMID: 33347301]
  19. Nucleic Acids Res. 2007 Jan;35(Database issue):D198-201 [PMID: 17145705]
  20. Proc Int Conf Web Search Data Min. 2022 Feb;2022:1300-1309 [PMID: 35647617]

Grants

  1. R35 GM124952/NIGMS NIH HHS

MeSH Term

Amino Acid Sequence
Drug Discovery
Neural Networks, Computer
Proteins
Software

Chemicals

Proteins

Word Cloud

Created with Highcharts 10.0.0proteincompound-proteincontactself-supervisedaffinitypredictioncross-modalitylearningembeddingdataavailablemodalitiesmethodsCPACscarcitygeneralizabilityrespectivelyembeddedpredictingaffinitiescontactsMOTIVATION:ComputationalaimfacilitatingrationaldrugdiscoverysimultaneousstrengthpatterninteractionsAlthoughdesiredoutputshighlystructure-dependentlackstructuresoftenmakesstructure-freerelysequenceinputsalonepairslabelslimitsaccuracymodelsRESULTS:overcomeaforementionedchallengesstructurenaivetylabeled-dataintroducestructure-awaretask-relevantSpecifically1Damino-acidsequencespredicted2DmapsseparatelyrecurrentgraphneuralnetworkswelljointlytwoschemesFurthermorepre-trainedvariousstrategiesleveragingmassiveamountunlabeledresultsindicateindividualdifferstrengthsPropercombinedimprovesmodelunseenproteinsAVAILABILITYANDIMPLEMENTATION:Datasourcecodeshttps://githubcom/Shen-Lab/CPACSUPPLEMENTARYINFORMATION:SupplementaryBioinformaticsonlineCross-modality

Similar Articles

Cited By