A Hybrid Machine Learning Approach to Screen Optimal Predictors for the Classification of Primary Breast Tumors from Gene Expression Microarray Data.

Nashwan Alromema, Asif Hassan Syed, Tabrej Khan
Author Information
  1. Nashwan Alromema: Department of Computer Science, Faculty of Computing and Information Technology Rabigh (FCITR), King Abdulaziz University, Jeddah 22254, Saudi Arabia. ORCID
  2. Asif Hassan Syed: Department of Computer Science, Faculty of Computing and Information Technology Rabigh (FCITR), King Abdulaziz University, Jeddah 22254, Saudi Arabia. ORCID
  3. Tabrej Khan: Department of Information Systems, Faculty of Computing and Information Technology Rabigh (FCITR), King Abdulaziz University, Jeddah 22254, Saudi Arabia. ORCID

Abstract

The high dimensionality and sparsity of the microarray gene expression data make it challenging to analyze and screen the optimal subset of genes as predictors of breast cancer (BC). The authors in the present study propose a novel hybrid Feature Selection (FS) sequential framework involving minimum Redundancy-Maximum Relevance (mRMR), a two-tailed unpaired -test, and meta-heuristics to screen the most optimal set of gene biomarkers as predictors for BC. The proposed framework identified a set of three most optimal gene biomarkers, namely, MAPK 1, APOBEC3B, and ENAH. In addition, the state-of-the-art supervised Machine Learning (ML) algorithms, namely Support Vector Machine (SVM), K-Nearest Neighbors (KNN), Neural Net (NN), Naïve Bayes (NB), Decision Tree (DT), eXtreme Gradient Boosting (XGBoost), and Logistic Regression (LR) were used to test the predictive capability of the selected gene biomarkers and select the most effective breast cancer diagnostic model with higher values of performance matrices. Our study found that the XGBoost-based model was the superior performer with an accuracy of 0.976 ± 0.027, an F1-Score of 0.974 ± 0.030, and an AUC value of 0.961 ± 0.035 when tested on an independent test dataset. The screened gene biomarkers-based classification system efficiently detects primary breast tumors from normal breast samples.

Keywords

References

  1. Breast Cancer Res Treat. 2011 May;127(1):53-63 [PMID: 20499159]
  2. Nucleic Acids Res. 2021 Jan 8;49(D1):D325-D334 [PMID: 33290552]
  3. Oncotarget. 2017 Aug 2;8(42):72466-72479 [PMID: 29069803]
  4. Comput Biol Med. 2021 Nov 23;140:105051 [PMID: 34839186]
  5. Sensors (Basel). 2017 Jul 05;17(7): [PMID: 28678153]
  6. Bioengineering (Basel). 2021 Dec 10;8(12): [PMID: 34940361]
  7. Int J Oncol. 2014 Nov;45(5):1921-8 [PMID: 25109497]
  8. AJR Am J Roentgenol. 2017 May;208(5):1147-1153 [PMID: 28225634]
  9. PLoS One. 2022 Mar 15;17(3):e0265351 [PMID: 35290401]
  10. Breast Cancer Res. 2006;8(4):R41 [PMID: 16859500]
  11. BMC Bioinformatics. 2009 May 16;10:147 [PMID: 19445687]
  12. Clin Cancer Res. 2006 Mar 1;12(5):1470-8 [PMID: 16533770]
  13. Cancer Gene Ther. 2014 Jul;21(7):283-9 [PMID: 24924199]
  14. Int J Cancer. 2004 May 10;109(6):909-18 [PMID: 15027125]
  15. Proc Natl Acad Sci U S A. 2003 Sep 2;100(18):10393-8 [PMID: 12917485]
  16. Front Genet. 2019 Mar 27;10:256 [PMID: 30972106]
  17. Oncotarget. 2016 Oct 25;7(43):70494-70503 [PMID: 27655637]
  18. J Pers Med. 2022 Mar 22;12(4): [PMID: 35455625]
  19. J Steroid Biochem Mol Biol. 2002 Feb;80(2):239-56 [PMID: 11897507]
  20. Sci Adv. 2016 Oct 07;2(10):e1601737 [PMID: 27730215]
  21. N Engl J Med. 2016 Aug 25;375(8):717-29 [PMID: 27557300]
  22. Nat Genet. 2000 May;25(1):25-9 [PMID: 10802651]
  23. Cell Biosci. 2017 May 30;7:29 [PMID: 28572915]
  24. Cancer. 2020 Jul 1;126(13):2971-2979 [PMID: 32390151]
  25. CA Cancer J Clin. 2007 Mar-Apr;57(2):75-89 [PMID: 17392385]
  26. J Comput Biol. 2019 Apr;26(4):376-386 [PMID: 30789283]
  27. Trends Pharmacol Sci. 2022 May;43(5):362-377 [PMID: 35272863]
  28. Clin Cancer Res. 2008 Aug 1;14(15):4943-50 [PMID: 18676769]
  29. Bioinformatics. 2018 Feb 1;34(3):398-406 [PMID: 29028927]
  30. Bioengineering (Basel). 2021 Dec 27;9(1): [PMID: 35049716]
  31. Healthcare (Basel). 2022 Mar 20;10(3): [PMID: 35327056]
  32. Cancers (Basel). 2019 Oct 22;11(10): [PMID: 31652660]
  33. J Ambient Intell Humaniz Comput. 2021 Nov 20;:1-11 [PMID: 34840618]
  34. Discov Med. 2014 May;17(95):275-83 [PMID: 24882719]
  35. Mol Biotechnol. 2005 Oct;31(2):151-74 [PMID: 16170216]
  36. Contemp Oncol (Pozn). 2015;19(1A):A68-77 [PMID: 25691825]
  37. Genes Dis. 2018 May 12;5(2):77-106 [PMID: 30258937]
  38. J Bioinform Comput Biol. 2005 Apr;3(2):185-205 [PMID: 15852500]
  39. Ann Card Anaesth. 2019 Oct-Dec;22(4):407-411 [PMID: 31621677]
  40. Breast Cancer Res. 2015 Jan 21;17:8 [PMID: 25848704]
  41. J Biol Chem. 2005 Aug 5;280(31):28653-62 [PMID: 15939738]
  42. JAMA. 2015 Oct 20;314(15):1599-614 [PMID: 26501536]
  43. Bioinformatics. 2018 Jul 15;34(14):2425-2432 [PMID: 29490018]
  44. Breast Cancer Res Treat. 2016 Oct;159(3):457-67 [PMID: 27592113]
  45. Proc Natl Acad Sci U S A. 2020 Nov 24;117(47):29684-29690 [PMID: 33184177]
  46. BMC Bioinformatics. 2013 Mar 22;14:106 [PMID: 23522326]
  47. Oncol Lett. 2022 Nov 03;24(6):460 [PMID: 36380877]
  48. Nature. 2022 Jul;607(7920):799-807 [PMID: 35859169]
  49. World J Clin Oncol. 2014 May 10;5(2):61-70 [PMID: 24829852]
  50. Oncol Lett. 2020 Mar;19(3):1842-1848 [PMID: 32194678]

Grants

  1. G:190-830-1442/Deanship of Scientific Research (DSR), King Abdulaziz University, Jeddah

Word Cloud

Created with Highcharts 10.0.0breast0geneoptimalbiomarkersMachine±screenpredictorscancerBCstudyframeworktwo-tailedunpairedmeta-heuristicssetnamelysupervisedLearningtestmodelprimarytumorhighdimensionalitysparsitymicroarrayexpressiondatamakechallenginganalyzesubsetgenesauthorspresentproposenovelhybridFeatureSelectionFSsequentialinvolvingminimumRedundancy-MaximumRelevancemRMR-testproposedidentifiedthreeMAPK1APOBEC3BENAHadditionstate-of-the-artMLalgorithmsSupportVectorSVMK-NearestNeighborsKNNNeuralNetNNNaïveBayesNBDecisionTreeDTeXtremeGradientBoostingXGBoostLogisticRegressionLRusedpredictivecapabilityselectedselecteffectivediagnostichighervaluesperformancematricesfoundXGBoost-basedsuperiorperformeraccuracy976027F1-Score974030AUCvalue961035testedindependentdatasetscreenedbiomarkers-basedclassificationsystemefficientlydetectstumorsnormalsamplesHybridApproachScreenOptimalPredictorsClassificationPrimaryBreastTumorsGeneExpressionMicroarrayDatapredictionfilter-basedfsgene-biomarkershybrid-featureselectionapproachtechniquesmachinelearningclassifierst-test

Similar Articles

Cited By