Cancer gene identification through integrating causal prompting large language model with omics data-driven causal inference.

Haolong Zeng, Chaoyi Yin, Chunyang Chai, Yuezhu Wang, Qi Dai, Huiyan Sun
Author Information
  1. Haolong Zeng: School of Artificial Intelligence, Jilin University, 3003 Qianjin Street, Changchun 130012, Jilin Province, China. ORCID
  2. Chaoyi Yin: School of Artificial Intelligence, Jilin University, 3003 Qianjin Street, Changchun 130012, Jilin Province, China.
  3. Chunyang Chai: School of Artificial Intelligence, Jilin University, 3003 Qianjin Street, Changchun 130012, Jilin Province, China.
  4. Yuezhu Wang: School of Artificial Intelligence, Jilin University, 3003 Qianjin Street, Changchun 130012, Jilin Province, China. ORCID
  5. Qi Dai: College of Life Science and Medicine, Zhejiang Sci-Tech University, Second Street 928, Qiantang District, Hangzhou 310018, Zhejiang Province, China.
  6. Huiyan Sun: School of Artificial Intelligence, Jilin University, 3003 Qianjin Street, Changchun 130012, Jilin Province, China.

Abstract

Identifying genes causally linked to cancer from a multi-omics perspective is essential for understanding the mechanisms of cancer and improving therapeutic strategies. Traditional statistical and machine-learning methods that rely on generalized correlation approaches to identify cancer genes often produce redundant, biased predictions with limited interpretability, largely due to overlooking confounding factors, selection biases, and the nonlinear activation function in neural networks. In this study, we introduce a novel framework for identifying cancer genes across multiple omics domains, named ICGI (Integrative Causal Gene Identification), which leverages a large language model (LLM) prompted with causality contextual cues and prompts, in conjunction with data-driven causal feature selection. This approach demonstrates the effectiveness and potential of LLMs in uncovering cancer genes and comprehending disease mechanisms, particularly at the genomic level. However, our findings also highlight that current LLMs may not capture comprehensive information across all omics levels. By applying the proposed causal feature selection module to transcriptomic datasets from six cancer types in The Cancer Genome Atlas and comparing its performance with state-of-the-art methods, it demonstrates superior capability in identifying cancer genes that distinguish between cancerous and normal samples. Additionally, we have developed an online service platform that allows users to input a gene of interest and a specific cancer type. The platform provides automated results indicating whether the gene plays a significant role in cancer, along with clear and accessible explanations. Moreover, the platform summarizes the inference outcomes obtained from data-driven causal learning methods.

Keywords

References

  1. Nature. 2013 Jul 11;499(7457):214-218 [PMID: 23770567]
  2. Nucleic Acids Res. 2017 Jan 4;45(D1):D877-D887 [PMID: 27899610]
  3. Mol Syst Biol. 2013;9:637 [PMID: 23340843]
  4. Brief Bioinform. 2022 Jan 17;23(1): [PMID: 34791014]
  5. Comput Struct Biotechnol J. 2024 Mar 01;23:1154-1168 [PMID: 38510977]
  6. IEEE/ACM Trans Comput Biol Bioinform. 2023 Jan-Feb;20(1):185-195 [PMID: 35139025]
  7. Mol Cancer. 2022 Feb 21;21(1):57 [PMID: 35189910]
  8. Am J Transplant. 2020 Apr;20(4):931-941 [PMID: 31680428]
  9. Bioinformatics. 2021 May 23;37(8):1115-1124 [PMID: 33305308]
  10. Bioinformatics. 2018 Jun 1;34(11):1893-1903 [PMID: 29329368]
  11. Genome Biol. 2016 Jun 16;17(1):128 [PMID: 27311963]
  12. PLoS Comput Biol. 2022 Sep 22;18(9):e1010529 [PMID: 36137089]
  13. Biostatistics. 2002 Jun;3(2):179-93 [PMID: 12933612]
  14. Nucleic Acids Res. 2024 Jan 5;52(D1):D1210-D1217 [PMID: 38183204]
  15. CA Cancer J Clin. 2021 May;71(3):209-249 [PMID: 33538338]
  16. Nucleic Acids Res. 2010 Jul;38(Web Server issue):W71-7 [PMID: 20457745]
  17. Bioinformatics. 2013 Nov 1;29(21):2757-64 [PMID: 23986566]
  18. Nucleic Acids Res. 2011 Jul;39(Web Server issue):W424-9 [PMID: 21576238]
  19. IEEE Trans Cybern. 2020 Dec;50(12):4983-4996 [PMID: 31634853]
  20. NPJ Precis Oncol. 2019 Mar 6;3:7 [PMID: 30854468]
  21. Nucleic Acids Res. 2019 Jan 8;47(D1):D23-D28 [PMID: 30395293]
  22. Cell. 2024 Mar 28;187(7):1589-1616 [PMID: 38552609]
  23. Nat Med. 2021 Jul;27(7):1129-1130 [PMID: 34140704]
  24. Nucleic Acids Res. 2020 Jan 8;48(D1):D863-D870 [PMID: 31701128]
  25. PLoS One. 2018 May 8;13(5):e0196939 [PMID: 29738578]
  26. Multivariate Behav Res. 2011 May;46(3):399-424 [PMID: 21818162]
  27. Biometrics. 2002 Mar;58(1):21-9 [PMID: 11890317]
  28. Contemp Oncol (Pozn). 2015;19(1A):A68-77 [PMID: 25691825]
  29. Neural Netw. 2013 Jul;43:63-71 [PMID: 23500501]
  30. Nucleic Acids Res. 2019 May 7;47(8):e45 [PMID: 30773592]
  31. Bioinformatics. 2019 Jul 15;35(14):i427-i435 [PMID: 31510671]
  32. Genome Biol. 2012 Dec 22;13(12):R124 [PMID: 23383675]

Grants

  1. 451240122094/Graduate Innovation Fund of Jilin University
  2. 2024JBGS06/Bethune Medical College of Jilin University
  3. 45123031J004/Jilin University
  4. 2021R52019/High level Talents in Zhejiang Province
  5. 20240101025JJ/Natural Science Foundation of Jilin Province
  6. 62372210/National Natural Science Foundation of China

MeSH Term

Humans
Neoplasms
Genomics
Computational Biology
Algorithms
Genes, Neoplasm
Machine Learning

Word Cloud

Created with Highcharts 10.0.0cancergenescausalmethodsselectionomicslargelanguagemodeldata-drivenplatformgenemulti-omicsmechanismsidentifyingacrosscausalityfeaturedemonstratesLLMsCancerinferencelearningIdentifyingcausallylinkedperspectiveessentialunderstandingimprovingtherapeuticstrategiesTraditionalstatisticalmachine-learningrelygeneralizedcorrelationapproachesidentifyoftenproduceredundantbiasedpredictionslimitedinterpretabilitylargelydueoverlookingconfoundingfactorsbiasesnonlinearactivationfunctionneuralnetworksstudyintroducenovelframeworkmultipledomainsnamedICGIIntegrativeCausalGeneIdentificationleveragesLLMpromptedcontextualcuespromptsconjunctionapproacheffectivenesspotentialuncoveringcomprehendingdiseaseparticularlygenomiclevelHoweverfindingsalsohighlightcurrentmaycapturecomprehensiveinformationlevelsapplyingproposedmoduletranscriptomicdatasetssixtypesGenomeAtlascomparingperformancestate-of-the-artsuperiorcapabilitydistinguishcancerousnormalsamplesAdditionallydevelopedonlineserviceallowsusersinputinterestspecifictypeprovidesautomatedresultsindicatingwhetherplayssignificantrolealongclearaccessibleexplanationsMoreoversummarizesoutcomesobtainedidentificationintegratingpromptingprompt

Similar Articles

Cited By