Multiomics-Based Feature Extraction and Selection for the Prediction of Lung Cancer Survival.

Roman Jaksik, Kamila Szumała, Khanh Ngoc Dinh, Jarosław Śmieja
Author Information
  1. Roman Jaksik: Department of Systems Biology and Engineering, Silesian University of Technology, 44-100 Gliwice, Poland. ORCID
  2. Kamila Szumała: Faculty of Automatic Control, Electronics and Computer Science, Silesian University of Technology, 44-100 Gliwice, Poland.
  3. Khanh Ngoc Dinh: Irving Institute for Cancer Dynamics and Department of Statistics, Columbia University, New York, NY 10027, USA. ORCID
  4. Jarosław Śmieja: Department of Systems Biology and Engineering, Silesian University of Technology, 44-100 Gliwice, Poland.

Abstract

Lung cancer is a global health challenge, hindered by delayed diagnosis and the disease's complex molecular landscape. Accurate patient survival prediction is critical, motivating the exploration of various -omics datasets using machine learning methods. Leveraging multi-omics data, this study seeks to enhance the accuracy of survival prediction by proposing new feature extraction techniques combined with unbiased feature selection. Two lung adenocarcinoma multi-omics datasets, originating from the TCGA and CPTAC-3 projects, were employed for this purpose, emphasizing gene expression, methylation, and mutations as the most relevant data sources that provide features for the survival prediction models. Additionally, gene set aggregation was shown to be the most effective feature extraction method for mutation and copy number variation data. Using the TCGA dataset, we identified 32 molecular features that allowed the construction of a 2-year survival prediction model with an AUC of 0.839. The selected features were additionally tested on an independent CPTAC-3 dataset, achieving an AUC of 0.815 in nested cross-validation, which confirmed the robustness of the identified features.

Keywords

References

  1. Brief Bioinform. 2021 May 20;22(3): [PMID: 32823283]
  2. Genome Biol. 2014;15(12):550 [PMID: 25516281]
  3. Nat Rev Dis Primers. 2015 May 21;1:15009 [PMID: 27188576]
  4. Cancer Med. 2022 Nov;11(21):4053-4069 [PMID: 35575002]
  5. Sci Rep. 2020 Nov 18;10(1):20135 [PMID: 33208770]
  6. BMC Bioinformatics. 2021 Apr 26;22(1):215 [PMID: 33902448]
  7. Trends Genet. 2008 Nov;24(11):529-38 [PMID: 18809224]
  8. Bioinform Adv. 2023 Apr 13;3(1):vbad048 [PMID: 37113250]
  9. Front Genet. 2023 Jan 04;13:1049501 [PMID: 36685831]
  10. Cancer Inform. 2007 Feb 11;2:59-77 [PMID: 19458758]
  11. Methods. 2017 Feb 1;114:4-15 [PMID: 27565742]
  12. Med Phys. 2010 Apr;37(4):1401-7 [PMID: 20443461]
  13. ERJ Open Res. 2021 May 24;7(2): [PMID: 34046489]
  14. Radiat Oncol. 2022 Apr 13;17(1):74 [PMID: 35418206]
  15. Cancer Control. 2021 Jan-Dec;28:10732748211044678 [PMID: 34693730]
  16. Cancer Nurs. 2022 Nov-Dec 01;45(6):E883-E889 [PMID: 35728011]
  17. Proc Am Thorac Soc. 2009 Apr 15;6(2):152-8 [PMID: 19349483]
  18. Oncotarget. 2018 Oct 30;9(85):35528-35540 [PMID: 30473748]
  19. Bioinformatics. 2022 Sep 30;38(19):4466-4473 [PMID: 35929780]
  20. Elife. 2015 Jun 05;4:e06907 [PMID: 26047463]
  21. Science. 2022 Apr 22;376(6591): [PMID: 35949260]
  22. J Stat Softw. 2010;33(1):1-22 [PMID: 20808728]
  23. PLoS One. 2019 Jun 12;14(6):e0217434 [PMID: 31188861]
  24. Sci Rep. 2020 Mar 13;10(1):4679 [PMID: 32170141]
  25. Front Oncol. 2018 Dec 12;8:621 [PMID: 30631754]
  26. Clin Chem. 1992 Jan;38(1):34-8 [PMID: 1733603]
  27. Nature. 2023 Jun;618(7964):333-341 [PMID: 37165194]
  28. NAR Genom Bioinform. 2023 Jan 23;5(1):lqad005 [PMID: 36694663]
  29. Sci Rep. 2013 Oct 04;3:2855 [PMID: 24092472]
  30. Mol Oncol. 2019 Oct;13(10):2194-2210 [PMID: 31402556]
  31. Sci Rep. 2023 Feb 11;13(1):2480 [PMID: 36774368]
  32. PLoS Med. 2017 Apr 4;14(4):e1002277 [PMID: 28376113]
  33. J Biomed Sci. 2017 Jun 14;24(1):37 [PMID: 28615068]
  34. Nature. 2014 Jul 31;511(7511):543-50 [PMID: 25079552]
  35. Comput Methods Programs Biomed. 2013 Aug;111(2):519-24 [PMID: 23727300]
  36. IEEE Trans Biomed Eng. 2016 May;63(5):1034-1043 [PMID: 26390440]
  37. Int J Cancer. 2021 Jul 15;149(2):250-263 [PMID: 33783822]
  38. J Cell Physiol. 2019 Apr;234(4):4454-4459 [PMID: 30317601]
  39. Nat Genet. 2020 Mar;52(3):331-341 [PMID: 32025003]
  40. Front Oncol. 2021 Nov 24;11:788740 [PMID: 34900744]
  41. Cancers (Basel). 2021 Jul 13;13(14): [PMID: 34298709]
  42. BMC Med Inform Decis Mak. 2020 Jun 16;20(1):108 [PMID: 32546157]
  43. Nature. 2020 Feb;578(7793):94-101 [PMID: 32025018]
  44. Int J Cancer. 2010 Sep 1;127(6):1412-20 [PMID: 20054857]
  45. Clin Res Hepatol Gastroenterol. 2022 Nov;46(9):101999 [PMID: 35870795]
  46. Oncotarget. 2017 Jun 27;8(26):42007-42019 [PMID: 28159927]
  47. PLoS One. 2013 Dec 06;8(12):e82349 [PMID: 24324773]
  48. Cell Genom. 2022 Nov 09;2(11):None [PMID: 36388765]
  49. Genome Med. 2023 Jul 7;15(1):47 [PMID: 37420249]
  50. Proc Natl Acad Sci U S A. 2005 Oct 25;102(43):15545-50 [PMID: 16199517]
  51. J Thorac Dis. 2020 Aug;12(8):4531-4535 [PMID: 32944369]
  52. Cancer Biomark. 2020;27(2):243-250 [PMID: 32083573]
  53. J Chronic Dis. 1985;38(2):171-86 [PMID: 3882734]
  54. Lung Cancer. 2022 Aug;170:34-40 [PMID: 35700630]
  55. Clin Epigenetics. 2015 Jan 22;7:3 [PMID: 25657825]
  56. Cancer. 2018 Feb 15;124(4):775-784 [PMID: 29315497]
  57. Comput Math Methods Med. 2021 Oct 31;2021:9025470 [PMID: 34754327]
  58. Sci Rep. 2021 Jun 3;11(1):11805 [PMID: 34083687]
  59. J Bioinform Comput Biol. 2019 Jun;17(3):1940007 [PMID: 31288636]
  60. Cancers (Basel). 2021 Feb 24;13(5): [PMID: 33668244]
  61. Hum Hered. 2018;83(2):65-70 [PMID: 29864749]
  62. Nature. 2012 Sep 27;489(7417):519-25 [PMID: 22960745]
  63. Cancers (Basel). 2022 Nov 13;14(22): [PMID: 36428662]
  64. Hum Genomics. 2022 Jul 25;16(1):26 [PMID: 35879805]
  65. JTO Clin Res Rep. 2022 Mar 09;3(4):100307 [PMID: 35400080]
  66. Cancers (Basel). 2020 Mar 05;12(3): [PMID: 32150991]
  67. Clin Nucl Med. 2019 Dec;44(12):956-960 [PMID: 31689276]
  68. Ann Transl Med. 2021 Oct;9(20):1597 [PMID: 34790803]

Grants

  1. 2021/41/B/NZ2/04134/National Science Centre

MeSH Term

Humans
Lung Neoplasms
Multiomics
DNA Copy Number Variations
Adenocarcinoma of Lung
Research Design

Word Cloud

Created with Highcharts 10.0.0survivalpredictionfeaturedatafeaturesextractionLungcancermoleculardatasetsmachinelearningmulti-omicsselectionlungTCGACPTAC-3genedatasetidentifiedAUC0globalhealthchallengehindereddelayeddiagnosisdisease'scomplexlandscapeAccuratepatientcriticalmotivatingexplorationvarious-omicsusingmethodsLeveragingstudyseeksenhanceaccuracyproposingnewtechniquescombinedunbiasedTwoadenocarcinomaoriginatingprojectsemployedpurposeemphasizingexpressionmethylationmutationsrelevantsourcesprovidemodelsAdditionallysetaggregationshowneffectivemethodmutationcopynumbervariationUsing32allowedconstruction2-yearmodel839selectedadditionallytestedindependentachieving815nestedcross-validationconfirmedrobustnessMultiomics-BasedFeatureExtractionSelectionPredictionCancerSurvivalmultiomicsnext-generationsequencing

Similar Articles

Cited By