The promises of large language models for protein design and modeling.

Giorgio Valentini, Dario Malchiodi, Jessica Gliozzo, Marco Mesiti, Mauricio Soto-Gomez, Alberto Cabri, Justin Reese, Elena Casiraghi, Peter N Robinson
Author Information
  1. Giorgio Valentini: AnacletoLab, Dipartimento di Informatica, Università degli Studi di Milano, Milan, Italy.
  2. Dario Malchiodi: AnacletoLab, Dipartimento di Informatica, Università degli Studi di Milano, Milan, Italy.
  3. Jessica Gliozzo: AnacletoLab, Dipartimento di Informatica, Università degli Studi di Milano, Milan, Italy.
  4. Marco Mesiti: AnacletoLab, Dipartimento di Informatica, Università degli Studi di Milano, Milan, Italy.
  5. Mauricio Soto-Gomez: AnacletoLab, Dipartimento di Informatica, Università degli Studi di Milano, Milan, Italy.
  6. Alberto Cabri: AnacletoLab, Dipartimento di Informatica, Università degli Studi di Milano, Milan, Italy.
  7. Justin Reese: Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, Berkeley, CA, United States.
  8. Elena Casiraghi: AnacletoLab, Dipartimento di Informatica, Università degli Studi di Milano, Milan, Italy.
  9. Peter N Robinson: Jackson Lab for Genomic Medicine, Farmington, CT, United States.

Abstract

The recent breakthroughs of Large Language Models (LLMs) in the context of natural language processing have opened the way to significant advances in protein research. Indeed, the relationships between human natural language and the "language of proteins" invite the application and adaptation of LLMs to protein modelling and design. Considering the impressive results of GPT-4 and other recently developed LLMs in processing, generating and translating human languages, we anticipate analogous results with the language of proteins. Indeed, protein language models have been already trained to accurately predict protein properties, generate novel functionally characterized proteins, achieving state-of-the-art results. In this paper we discuss the promises and the open challenges raised by this novel and exciting research area, and we propose our perspective on how LLMs will affect protein modeling and design.

Keywords

References

  1. Cell Syst. 2023 Nov 15;14(11):979-989.e4 [PMID: 37909045]
  2. Nat Mach Intell. 2019 May;1(5):206-215 [PMID: 35603010]
  3. Nucleic Acids Res. 2023 Jan 6;51(D1):D523-D531 [PMID: 36408920]
  4. Nature. 2023 Apr;616(7956):259-265 [PMID: 37045921]
  5. Adv Neural Inf Process Syst. 2019 Dec;32:9689-9701 [PMID: 33390682]
  6. Nat Biotechnol. 2023 Aug;41(8):1099-1106 [PMID: 36702895]
  7. Proc Natl Acad Sci U S A. 2023 Mar 28;120(13):e2215907120 [PMID: 36943882]
  8. J Mol Biol. 1994 Feb 4;235(5):1501-31 [PMID: 8107089]
  9. IEEE Trans Pattern Anal Mach Intell. 2022 Oct;44(10):7112-7127 [PMID: 34232869]
  10. Nat Commun. 2022 Jul 27;13(1):4348 [PMID: 35896542]
  11. Bioinformatics. 2018 Apr 15;34(8):1295-1303 [PMID: 29228193]
  12. Bioinformatics. 2023 Feb 3;39(2): [PMID: 36692152]
  13. Bioinformatics. 2024 Mar 4;40(3): [PMID: 38244570]
  14. Chem Sci. 2020 Mar 3;11(12):3316-3325 [PMID: 34122839]
  15. Nature. 2021 Aug;596(7873):583-589 [PMID: 34265844]
  16. Comput Struct Biotechnol J. 2021 Mar 25;19:1750-1758 [PMID: 33897979]
  17. Sci Rep. 2021 Dec 8;11(1):23705 [PMID: 34880354]
  18. Sci Rep. 2021 Jan 11;11(1):321 [PMID: 33432013]
  19. Protein Sci. 2023 Jan;32(1):e4524 [PMID: 36454227]
  20. Trends Biotechnol. 2020 Jul;38(7):729-744 [PMID: 31954530]
  21. Proc Natl Acad Sci U S A. 2021 Apr 13;118(15): [PMID: 33876751]
  22. Nat Biotechnol. 2024 Feb;42(2):275-283 [PMID: 37095349]
  23. Entropy (Basel). 2024 Mar 12;26(3): [PMID: 38539763]
  24. BMC Bioinformatics. 2009 Oct 08;10:323 [PMID: 19814800]
  25. Nat Biomed Eng. 2022 Dec;6(12):1346-1352 [PMID: 35953649]
  26. Bioinformatics. 2022 Apr 12;38(8):2102-2110 [PMID: 35020807]

Word Cloud

Created with Highcharts 10.0.0proteinlanguageLLMsdesignresultsmodelsmodelingnaturalprocessingresearchIndeedhumanproteinsnovelpromiseslargerecentbreakthroughsLargeLanguageModelscontextopenedwaysignificantadvancesrelationships"languageproteins"inviteapplicationadaptationmodellingConsideringimpressiveGPT-4recentlydevelopedgeneratingtranslatinglanguagesanticipateanalogousalreadytrainedaccuratelypredictpropertiesgeneratefunctionallycharacterizedachievingstate-of-the-artpaperdiscussopenchallengesraisedexcitingareaproposeperspectivewillaffectdeeplearningengineeringtransformers

Similar Articles

Cited By