A systematic review of deep learning chemical language models in recent era.

Hector Flores-Hernandez, Emmanuel Martinez-Ledesma
Author Information
  1. Hector Flores-Hernandez: Tecnológico de Monterrey, School of Engineering and Sciences, Monterrey, 64710, Nuevo León, México.
  2. Emmanuel Martinez-Ledesma: Tecnológico de Monterrey, School of Medicine and Health Sciences, Monterrey, 64710, Nuevo León, México. juanemmanuel@tec.mx.

Abstract

Discovering new chemical compounds with specific properties can provide advantages for fields that rely on materials for their development, although this task comes at a high cost in terms of complexity and resources. Since the beginning of the data age, deep learning techniques have revolutionized the process of designing molecules by analyzing and learning from representations of molecular data, greatly reducing the resources and time involved. Various deep learning approaches have been developed to date, using a variety of architectures and strategies, in order to explore the extensive and discontinuous chemical space, providing benefits for generating compounds with specific properties. In this study, we present a systematic review that offers a statistical description and comparison of the strategies utilized to generate molecules through deep learning techniques, utilizing the metrics proposed in Molecular Sets (MOSES) or Guacamol. The study included 48 articles retrieved from a query-based search of Scopus and Web of Science and 25 articles retrieved from citation search, yielding a total of 72 retrieved articles, of which 62 correspond to chemical language models approaches to molecule generation and other 10 retrieved articles correspond to molecular graph representations. Transformers, recurrent neural networks (RNNs), generative adversarial networks (GANs), Structured Space State Sequence (S4) models, and variational autoencoders (VAEs) are considered the main deep learning architectures used for molecule generation in the set of retrieved articles. In addition, transfer learning, reinforcement learning, and conditional learning are the most employed techniques for biased model generation and exploration of specific chemical space regions. Finally, this analysis focuses on the central themes of molecular representation, databases, training dataset size, validity-novelty trade-off, and performance of unbiased and biased chemical language models. These themes were selected to conduct a statistical analysis utilizing graphical representation and statistical tests. The resulting analysis reveals the main challenges, advantages, and opportunities in the field of chemical language models over the past four years.

Keywords

References

  1. iScience. 2022 Jun 22;25(7):104661 [PMID: 35832891]
  2. J Cheminform. 2022 Apr 1;14(1):19 [PMID: 35365231]
  3. J Chem Inf Model. 2012 Nov 26;52(11):2864-75 [PMID: 23088335]
  4. J Cheminform. 2021 Nov 13;13(1):87 [PMID: 34774103]
  5. J Cheminform. 2022 Jun 26;14(1):40 [PMID: 35754029]
  6. Mol Inform. 2024 Jan;43(1):e202300288 [PMID: 38010610]
  7. Sci Rep. 2019 Jul 24;9(1):10752 [PMID: 31341196]
  8. Bioinformatics. 2023 Jan 1;39(1): [PMID: 36576008]
  9. Patterns (N Y). 2022 Oct 14;3(10):100588 [PMID: 36277819]
  10. Nat Commun. 2023 Oct 6;14(1):6234 [PMID: 37803000]
  11. J Chem Inf Model. 2020 Oct 26;60(10):4582-4593 [PMID: 32845150]
  12. J Chem Inf Model. 2015 Feb 23;55(2):263-74 [PMID: 25635324]
  13. J Chem Inf Model. 2021 Nov 22;61(11):5343-5361 [PMID: 34699719]
  14. J Chem Inf Model. 2020 Dec 28;60(12):6065-6073 [PMID: 33118813]
  15. Comput Biol Med. 2024 Feb;169:107811 [PMID: 38168647]
  16. Sci Rep. 2021 Jul 20;11(1):14798 [PMID: 34285269]
  17. J Chem Inf Model. 2019 Mar 25;59(3):1096-1108 [PMID: 30887799]
  18. J Chem Inf Model. 2020 Dec 28;60(12):5637-5646 [PMID: 33301333]
  19. Bioinformatics. 2024 May 2;40(5): [PMID: 38632084]
  20. Sci Rep. 2021 Mar 12;11(1):5852 [PMID: 33712669]
  21. J Med Chem. 2023 Jun 22;66(12):8170-8177 [PMID: 37256819]
  22. Nat Commun. 2023 Jul 28;14(1):4552 [PMID: 37507402]
  23. Mol Pharm. 2018 Oct 1;15(10):4398-4405 [PMID: 30180591]
  24. Sci Rep. 2023 May 31;13(1):8799 [PMID: 37258546]
  25. J Chem Inf Model. 2022 Oct 24;62(20):4863-4872 [PMID: 36219571]
  26. J Cheminform. 2023 Oct 4;15(1):91 [PMID: 37794460]
  27. J Med Chem. 2020 Aug 27;63(16):8683-8694 [PMID: 32672961]
  28. J Cheminform. 2023 Sep 25;15(1):88 [PMID: 37749655]
  29. Nucleic Acids Res. 2023 Jan 6;51(D1):D1373-D1380 [PMID: 36305812]
  30. J Cheminform. 2020 Sep 17;12(1):56 [PMID: 33431035]
  31. Molecules. 2023 May 30;28(11): [PMID: 37298906]
  32. J Chem Inf Model. 2021 Dec 27;61(12):5815-5826 [PMID: 34866384]
  33. J Cheminform. 2023 Feb 14;15(1):22 [PMID: 36788579]
  34. Nature. 2015 May 28;521(7553):436-44 [PMID: 26017442]
  35. Front Pharmacol. 2022 Dec 07;13:1085665 [PMID: 36569316]
  36. J Med Chem. 2020 Aug 27;63(16):8705-8722 [PMID: 32366098]
  37. J Chem Inf Model. 2022 Mar 14;62(5):1199-1206 [PMID: 35191696]
  38. J Chem Inf Model. 2022 May 9;62(9):2064-2076 [PMID: 34694798]
  39. Nat Commun. 2022 Jun 7;13(1):3293 [PMID: 35672310]
  40. J Chem Inf Model. 2023 Dec 25;63(24):7617-7627 [PMID: 38079566]
  41. Nat Commun. 2022 Mar 4;13(1):1186 [PMID: 35246540]
  42. J Mol Model. 2023 Nov 6;29(12):361 [PMID: 37932607]
  43. ACS Med Chem Lett. 2020 Jul 14;11(8):1496-1505 [PMID: 32832015]
  44. Comput Biol Med. 2022 Aug;147:105728 [PMID: 35763931]
  45. J Cheminform. 2022 Oct 3;14(1):68 [PMID: 36192789]
  46. J Chem Inf Model. 2023 Dec 11;63(23):7392-7400 [PMID: 37993764]
  47. Nat Commun. 2020 Sep 25;11(1):4874 [PMID: 32978395]
  48. J Cheminform. 2020 Jun 8;12(1):42 [PMID: 33430983]
  49. J Mol Model. 2023 Mar 30;29(4):121 [PMID: 36991180]
  50. Nat Commun. 2020 Jan 3;11(1):10 [PMID: 31900408]
  51. J Cheminform. 2021 Jan 10;13(1):2 [PMID: 33423696]
  52. Nat Commun. 2024 Apr 22;15(1):3408 [PMID: 38649351]
  53. Front Artif Intell. 2024 Apr 16;7:1374148 [PMID: 38690194]
  54. Nat Commun. 2022 Feb 21;13(1):973 [PMID: 35190542]
  55. J Chem Inf Model. 2019 Jun 24;59(6):2545-2559 [PMID: 31194543]
  56. Chem Sci. 2020 Jul 22;11(31):8312-8322 [PMID: 34123096]
  57. J Cheminform. 2019 Mar 12;11(1):20 [PMID: 30868314]
  58. Neural Comput. 1997 Nov 15;9(8):1735-80 [PMID: 9377276]
  59. Bioinformatics. 2021 Jul 12;37(Suppl_1):i84-i92 [PMID: 34252946]
  60. J Cheminform. 2023 Mar 28;15(1):38 [PMID: 36978179]
  61. J Cheminform. 2022 Oct 14;14(1):69 [PMID: 36242073]
  62. Mol Inform. 2021 Oct;40(10):e2100045 [PMID: 34622551]
  63. Nat Commun. 2024 Jul 22;15(1):6176 [PMID: 39039051]
  64. Comput Biol Med. 2023 Sep;164:107285 [PMID: 37557054]
  65. ACS Chem Neurosci. 2012 Sep 19;3(9):649-57 [PMID: 23019491]
  66. Brief Funct Genomics. 2023 Jul 17;22(4):392-400 [PMID: 37078726]
  67. J Chem Inf Model. 2024 Apr 8;64(7):2733-2745 [PMID: 37366644]
  68. PLoS One. 2022 Jun 23;17(6):e0269461 [PMID: 35737661]
  69. J Chem Inf Model. 2021 Dec 27;61(12):5804-5814 [PMID: 34855384]
  70. J Chem Inf Model. 2018 Jan 22;58(1):27-35 [PMID: 29268609]
  71. Nucleic Acids Res. 2019 Jan 8;47(D1):D930-D940 [PMID: 30398643]
  72. Sci Adv. 2018 Jul 25;4(7):eaap7885 [PMID: 30050984]
  73. Nat Commun. 2023 Jun 12;14(1):3454 [PMID: 37308471]
  74. Comput Biol Med. 2022 Jun;145:105403 [PMID: 35339849]
  75. ACS Cent Sci. 2019 Sep 25;5(9):1572-1583 [PMID: 31572784]
  76. J Chem Inf Model. 2021 Jun 28;61(6):2572-2581 [PMID: 34015916]
  77. Nat Commun. 2024 Mar 14;15(1):2323 [PMID: 38485914]
  78. J Chem Inf Model. 2020 Jan 27;60(1):29-36 [PMID: 31820983]
  79. Interdiscip Sci. 2024 Sep;16(3):712-726 [PMID: 38683279]
  80. Nat Commun. 2022 Nov 12;13(1):6891 [PMID: 36371441]
  81. J Chem Inf Model. 2019 Jan 28;59(1):43-52 [PMID: 30016587]
  82. Bioinformatics. 2023 Apr 3;39(4): [PMID: 36961341]
  83. Proc Natl Acad Sci U S A. 1982 Apr;79(8):2554-8 [PMID: 6953413]
  84. Sci Rep. 2021 Jan 11;11(1):321 [PMID: 33432013]
  85. J Cheminform. 2022 Aug 4;14(1):52 [PMID: 35927691]
  86. J Cheminform. 2019 Nov 21;11(1):70 [PMID: 33430985]
  87. Molecules. 2024 Apr 19;29(8): [PMID: 38675687]
  88. J Cheminform. 2020 Mar 18;12(1):17 [PMID: 33431004]
  89. J Comput Aided Mol Des. 2013 Aug;27(8):675-9 [PMID: 23963658]
  90. Comput Biol Chem. 2023 Oct;106:107911 [PMID: 37450999]
  91. Nat Commun. 2023 Jan 7;14(1):114 [PMID: 36611029]
  92. Artif Intell Med. 2024 Apr;150:102827 [PMID: 38553166]
  93. Curr Opin Struct Biol. 2023 Apr;79:102527 [PMID: 36738564]
  94. J Cheminform. 2020 Apr 10;12(1):22 [PMID: 33430998]
  95. ACS Med Chem Lett. 2023 Jun 30;14(7):901-915 [PMID: 37465301]
  96. Mol Pharm. 2018 Oct 1;15(10):4406-4416 [PMID: 30063142]
  97. Nat Rev Drug Discov. 2020 May;19(5):353-364 [PMID: 31801986]
  98. Eur J Med Chem. 2020 Oct 15;204:112572 [PMID: 32711293]
  99. BMC Chem. 2021 Feb 2;15(1):8 [PMID: 33531083]
  100. Int J Mol Sci. 2023 Mar 31;24(7): [PMID: 37047543]
  101. Front Pharmacol. 2022 Jan 21;12:827606 [PMID: 35126153]
  102. PLoS Comput Biol. 2024 Jun 26;20(6):e1012229 [PMID: 38924082]
  103. J Cheminform. 2021 Feb 23;13(1):14 [PMID: 33622401]
  104. Cell Rep Med. 2022 Dec 20;3(12):100794 [PMID: 36306797]
  105. J Cheminform. 2020 Apr 22;12(1):27 [PMID: 33430978]
  106. J Cheminform. 2021 May 13;13(1):39 [PMID: 33985583]
  107. ACS Cent Sci. 2018 Jan 24;4(1):120-131 [PMID: 29392184]
  108. Int J Mol Sci. 2023 Jan 06;24(2): [PMID: 36674658]
  109. J Cheminform. 2020 Feb 18;12(1):14 [PMID: 33430996]
  110. Cells. 2022 Mar 07;11(5): [PMID: 35269537]
  111. Campbell Syst Rev. 2022 Mar 27;18(2):e1230 [PMID: 36911350]
  112. J Chem Inf Model. 2020 Oct 26;60(10):4487-4496 [PMID: 32697578]
  113. Front Pharmacol. 2020 Dec 18;11:565644 [PMID: 33390943]
  114. Drug Discov Today. 2018 Jun;23(6):1241-1250 [PMID: 29366762]
  115. ACS Cent Sci. 2018 Feb 28;4(2):268-276 [PMID: 29532027]
  116. J Chem Inf Model. 2020 Dec 28;60(12):5699-5713 [PMID: 32659085]
  117. J Cheminform. 2021 Mar 9;13(1):21 [PMID: 33750461]
  118. RSC Adv. 2021 Jul 27;11(42):25921-25932 [PMID: 35479483]
  119. J Cheminform. 2015 May 30;7:23 [PMID: 26136848]
  120. J Chem Inf Model. 2019 Feb 25;59(2):914-923 [PMID: 30669836]

Word Cloud

Created with Highcharts 10.0.0learningchemicalmodelsdeeparticlesretrievedlanguagenetworksspecifictechniquesmolecularstatisticalgenerationanalysiscompoundspropertiesadvantagesresourcesdatamoleculesrepresentationsapproachesarchitecturesstrategiesspacestudysystematicreviewutilizingsearchcorrespondmoleculeTransformersneuralRNNsadversarialGANsautoencodersVAEsmainconditionalbiasedthemesrepresentationDiscoveringnewcanprovidefieldsrelymaterialsdevelopmentalthoughtaskcomeshighcosttermscomplexitySincebeginningagerevolutionizedprocessdesigninganalyzinggreatlyreducingtimeinvolvedVariousdevelopeddateusingvarietyorderexploreextensivediscontinuousprovidingbenefitsgeneratingpresentoffersdescriptioncomparisonutilizedgeneratemetricsproposedMolecularSetsMOSESGuacamolincluded48query-basedScopusWebScience25citationyieldingtotal726210graphrecurrentgenerativeStructuredSpaceStateSequenceS4variationalconsideredusedsetadditiontransferreinforcementemployedmodelexplorationregionsFinallyfocusescentraldatabasestrainingdatasetsizevalidity-noveltytrade-offperformanceunbiasedselectedconductgraphicaltestsresultingrevealschallengesopportunitiesfieldpastfouryearsrecenteraChemicalCLMsGenerativeRecurrentReinforcementTransferVariational

Similar Articles

Cited By