Prototype-based contrastive substructure identification for molecular property prediction.

Gaoqi He, Shun Liu, Zhuoran Liu, Changbo Wang, Kai Zhang, Honglin Li
Author Information
  1. Gaoqi He: School of Computer Science and Technology, East China Normal University, 200062 Shanghai, China. ORCID
  2. Shun Liu: School of Computer Science and Technology, East China Normal University, 200062 Shanghai, China.
  3. Zhuoran Liu: School of Computer Science and Technology, East China Normal University, 200062 Shanghai, China.
  4. Changbo Wang: School of Computer Science and Technology, East China Normal University, 200062 Shanghai, China.
  5. Kai Zhang: School of Computer Science and Technology, East China Normal University, 200062 Shanghai, China.
  6. Honglin Li: Innovation Center for AI and Drug Discovery, East China Normal University, 200062 Shanghai, China.

Abstract

Substructure-based representation learning has emerged as a powerful approach to featurize complex attributed graphs, with promising results in molecular property prediction (MPP). However, existing MPP methods mainly rely on manually defined rules to extract substructures. It remains an open challenge to adaptively identify meaningful substructures from numerous molecular graphs to accommodate MPP tasks. To this end, this paper proposes Prototype-based cOntrastive Substructure IdentificaTion (POSIT), a self-supervised framework to autonomously discover substructural prototypes across graphs so as to guide end-to-end molecular fragmentation. During pre-training, POSIT emphasizes two key aspects of substructure identification: firstly, it imposes a soft connectivity constraint to encourage the generation of topologically meaningful substructures; secondly, it aligns resultant substructures with derived prototypes through a prototype-substructure contrastive clustering objective, ensuring attribute-based similarity within clusters. In the fine-tuning stage, a cross-scale attention mechanism is designed to integrate substructure-level information to enhance molecular representations. The effectiveness of the POSIT framework is demonstrated by experimental results from diverse real-world datasets, covering both classification and regression tasks. Moreover, visualization analysis validates the consistency of chemical priors with identified substructures. The source code is publicly available at https://github.com/VRPharmer/POSIT.

Keywords

References

  1. IEEE/ACM Trans Comput Biol Bioinform. 2023 Sep-Oct;20(5):3044-3055 [PMID: 37028366]
  2. PLoS One. 2008 Feb 06;3(2):e1537 [PMID: 18253485]
  3. PLoS Comput Biol. 2024 Jan 10;20(1):e1011773 [PMID: 38198480]
  4. IEEE Trans Neural Netw Learn Syst. 2024 Feb;35(2):2747-2758 [PMID: 35895656]
  5. Chem Sci. 2017 Oct 31;9(2):513-530 [PMID: 29629118]
  6. J Chem Inf Model. 2019 Aug 26;59(8):3370-3388 [PMID: 31361484]
  7. Brief Bioinform. 2022 Nov 19;23(6): [PMID: 36124766]
  8. Brief Bioinform. 2023 Sep 20;24(5): [PMID: 37598424]
  9. Nat Rev Drug Discov. 2020 May;19(5):353-364 [PMID: 31801986]
  10. J Chem Inf Model. 2023 Jan 9;63(1):43-55 [PMID: 36519623]
  11. J Chem Inf Model. 2021 Jun 28;61(6):2697-2705 [PMID: 34009965]
  12. J Med Chem. 2020 Aug 27;63(16):8749-8760 [PMID: 31408336]
  13. Bioinformatics. 2021 Sep 29;37(18):2981-2987 [PMID: 33769437]
  14. Drug Discov Today Technol. 2020 Dec;37:1-12 [PMID: 34895648]
  15. Brief Bioinform. 2021 Jul 20;22(4): [PMID: 33147620]
  16. Brief Bioinform. 2021 Nov 5;22(6): [PMID: 33951729]
  17. Bioinformatics. 2022 Sep 30;38(19):4573-4580 [PMID: 35961025]
  18. Brief Bioinform. 2021 Sep 2;22(5): [PMID: 33866354]
  19. Brief Bioinform. 2022 Jan 17;23(1): [PMID: 34734228]
  20. Brief Bioinform. 2022 Jan 17;23(1): [PMID: 34471921]
  21. Drug Discov Today. 2022 Dec;27(12):103373 [PMID: 36167282]

MeSH Term

Algorithms
Computational Biology
Software
Cluster Analysis
Molecular Structure

Word Cloud

Created with Highcharts 10.0.0molecularsubstructureslearninggraphspropertypredictionMPPPOSITcontrastiveresultsmeaningfultasksPrototype-basedself-supervisedframeworkprototypessubstructureSubstructure-basedrepresentationemergedpowerfulapproachfeaturizecomplexattributedpromisingHoweverexistingmethodsmainlyrelymanuallydefinedrulesextractremainsopenchallengeadaptivelyidentifynumerousaccommodateendpaperproposescOntrastiveSubstructureIdentificaTionautonomouslydiscoversubstructuralacrossguideend-to-endfragmentationpre-trainingemphasizestwokeyaspectsidentification:firstlyimposessoftconnectivityconstraintencouragegenerationtopologicallysecondlyalignsresultantderivedprototype-substructureclusteringobjectiveensuringattribute-basedsimilaritywithinclustersfine-tuningstagecross-scaleattentionmechanismdesignedintegratesubstructure-levelinformationenhancerepresentationseffectivenessdemonstratedexperimentaldiversereal-worlddatasetscoveringclassificationregressionMoreovervisualizationanalysisvalidatesconsistencychemicalpriorsidentifiedsourcecodepubliclyavailablehttps://githubcom/VRPharmer/POSITidentificationGraphNeuralNetworks

Similar Articles

Cited By

No available data.