Comparative analysis of feature selection techniques for COVID-19 dataset.

Farideh Mohtasham, MohamadAmin Pourhoseingholi, Seyed Saeed Hashemi Nazari, Kaveh Kavousi, Mohammad Reza Zali
Author Information
  1. Farideh Mohtasham: Gastroenterology and Liver Diseases Research Center, Research Institute for Gastroenterology and Liver Diseases, Shahid Beheshti University of Medical Sciences, Tehran, Iran. f-mohtasham@sbmu.ac.ir.
  2. MohamadAmin Pourhoseingholi: Hearing Sciences, Mental Health and Clinical Neurosciences, School of Medicine, National Institute for Health and Care Research (NIHR) Nottingham Biomedical Research Center, University of Nottingham, Nottingham, UK.
  3. Seyed Saeed Hashemi Nazari: Department of Epidemiology, School of Public Health & Safety, Shahid Beheshti University of Medical Sciences (SBMU), Tehran, Iran.
  4. Kaveh Kavousi: Laboratory of Complex Biological Systems and Bioinformatics (CBB), Department of Bioinformatics, Institute of Biochemistry and Biophysics (IBB), University of Tehran, Tehran, Iran. kkavousi@ut.ac.ir.
  5. Mohammad Reza Zali: Gastroenterology and Liver Diseases Research Center, Research Institute for Gastroenterology and Liver Diseases, Shahid Beheshti University of Medical Sciences, Tehran, Iran.

Abstract

In the context of early disease detection, machine learning (ML) has emerged as a vital tool. Feature selection (FS) algorithms play a crucial role in ensuring the accuracy of predictive models by identifying the most influential variables. This study, focusing on a retrospective cohort of 4778 COVID-19 patients from Iran, explores the performance of various FS methods, including filter, embedded, and hybrid approaches, in predicting mortality outcomes. The researchers leveraged 115 routine clinical, laboratory, and demographic features and employed 13 ML models to assess the effectiveness of these FS methods based on classification accuracy, predictive accuracy, and statistical tests. The results indicate that a Hybrid Boruta-VI model combined with the Random Forest algorithm demonstrated superior performance, achieving an accuracy of 0.89, an F1 score of 0.76, and an AUC value of 0.95 on test data. Key variables identified as important predictors of adverse outcomes include age, oxygen saturation levels, albumin levels, neutrophil counts, platelet levels, and markers of kidney function. These findings highlight the potential of advanced FS techniques and ML models in enhancing early disease detection and informing clinical decision-making.

References

  1. Mishra, S. & Pradhan, R. K. Analyzing the impact of feature correlation on classification acuracy of machine learning model. In 2023 International Conference on Artificial Intelligence and Smart Communication (AISC) (2023).
  2. Chandrashekar, G. & Sahin, F. A survey on feature selection methods. Comput. Electr. Eng. 40(1), 16–28 (2014). [DOI: 10.1016/j.compeleceng.2013.11.024]
  3. Venkatesh, B. & Anuradha, J. A review of feature selection and its methods. Cybern. Inf. Technol. 19(1), 3–26 (2019).
  4. Pudjihartono, N., Fadason, T., Kempa-Liehr, A. W. & O’Sullivan, J. M. A review of feature selection methods for machine learning-based disease risk prediction. Front. Bioinform. 2, 927312 (2022). [PMID: 36304293]
  5. Uppu, S., Krishna, A. & Gopalan, R. P. A review on methods for detecting SNP interactions in high-dimensional genomic data. IEEE/ACM Trans. Comput. Biol. Bioinform. 15(2), 599–612 (2016). [PMID: 28060710]
  6. Ali, R. H. & Abdulsalam, W. H. The prediction of COVID 19 disease using feature selection techniques. J. Phys. Conf. Ser. 1879, 1 (2021). [DOI: 10.1088/1742-6596/1879/2/022083]
  7. Pourhomayoun, M. & Shakibi, M. Predicting mortality risk in patients with COVID-19 using machine learning to help medical decision-making. Smart Health 20, 100178 (2021). [PMID: 33521226]
  8. Varzaneh, Z. A., Orooji, A., Erfannia, L. & Shanbehzadeh, M. A new COVID-19 intubation prediction strategy using an intelligent feature selection and K-NN method. Inform. Med. Unlocked 28, 100825 (2022). [PMID: 34977330]
  9. Hayet-Otero, M. et al. Extracting relevant predictive variables for COVID-19 severity prognosis: An exhaustive comparison of feature selection techniques. PLoS ONE 18(4), e0284150 (2023). [PMID: 37053151]
  10. Chamseddine, E., Mansouri, N., Soui, M. & Abed, M. Handling class imbalance in COVID-19 chest X-ray images classification: Using SMOTE and weighted loss. Appl. Soft Comput. 129, 109588 (2022). [PMID: 36061418]
  11. Javidi, M., Abbaasi, S., Naybandi Atashi, S. & Jampour, M. COVID-19 early detection for imbalanced or low number of data using a regularized cost-sensitive CapsNet. Sci. Rep. 11(1), 18478 (2021). [PMID: 34531477]
  12. Hatamabadi, H. et al. Epidemiology of COVID-19 in Tehran, Iran: A cohort study of clinical profile, risk factors, and outcomes. BioMed Res. Int. 2022, 2350063 (2022). [PMID: 35592525]
  13. Sharma, V. A study on data scaling methods for machine learning. Int. J. Glob. Acad. Sci. Res. 1(1), 23–33 (2022).
  14. Zali, A. et al. Baseline characteristics and associated factors of mortality in COVID-19 patients: An analysis of 16000 cases in Tehran, Iran. Arch. Acad. Emerg. Med. 8(1), e70 (2020). [PMID: 33134966]
  15. Ogundimu, E. O., Altman, D. G. & Collins, G. S. Adequate sample size for developing prediction models is not simply related to events per variable. J. Clin. Epidemiol. 76, 175–182 (2016). [PMID: 26964707]
  16. Alin, A. Multicollinearity. Wiley interdiscip. Rev. Comput. Stat. 2(3), 370–374 (2010). [DOI: 10.1002/wics.84]
  17. Daoud, J. I. Multicollinearity and regression analysis. J. Phys. Conf. Ser. 949, 1 (2017). [DOI: 10.1088/1742-6596/949/1/012009]
  18. Vidal-Naquet, M. & Ullman, S. (eds) Object Recognition with Informative Features and Linear Classification (ICCV, 2003).
  19. Fleuret, F. Fast binary feature selection with conditional mutual information. J. Mach. Learn. Res. 5, 9 (2004).
  20. Bommert, A., Welchowski, T., Schmid, M. & Rahnenführer, J. Benchmark of filter methods for feature selection in high-dimensional gene expression survival data. Brief. Bioinform. 23(1), 354 (2022). [DOI: 10.1093/bib/bbab354]
  21. Schratz, P. L. M. & Bischl, B. mlr3filters: Filter Based Feature Selection for ‘mlr3’ (2020).
  22. Bommert, A., Sun, X., Bischl, B., Rahnenführer, J. & Lang, M. Benchmark for filter methods for feature selection in high-dimensional classification data. Comput. Stat. Data Anal. 143, 106839 (2020). [DOI: 10.1016/j.csda.2019.106839]
  23. Menze, B. H. et al. A comparison of random forest and its Gini importance with standard chemometric methods for the feature selection and classification of spectral data. BMC Bioinform. 10(1), 213 (2009). [DOI: 10.1186/1471-2105-10-213]
  24. Nembrini, S., König, I. R. & Wright, M. N. The revival of the Gini importance? Bioinformatics 34(21), 3711–3718 (2018). [PMID: 29757357]
  25. Han, H., Guo, X. & Yu, H. Variable selection using mean decrease accuracy and mean decrease Gini based on random forest. In 2016 7th IEEE International Conference on Software Engineering and Service Science (ICSESS) (IEEE, 2016).
  26. Sheskin, D. J. Handbook of Parametric and Nonparametric Statistical Procedures (CRC Press, 2020). [DOI: 10.1201/9780429186196]
  27. Moorthy, U. & Gandhi, U. D. A novel optimal feature selection technique for medical data classification using ANOVA based whale optimization. J. Amb. Intell. Hum. Comput. 12, 3527–3538 (2021). [DOI: 10.1007/s12652-020-02592-w]
  28. Ladha, L. et al. Feature selection methods and algorithms. Int. J. Comput. Sci. Eng. 1, 1 (2022).
  29. Okser, S. et al. Regularized machine learning in the genetic prediction of complex traits. PLoS Genet. 10(11), e1004754 (2014). [PMID: 25393026]
  30. Kursa, M. B. & Rudnicki, W. R. Feature selection with the Boruta package. J. Stat. Softw. 36, 1–13 (2010). [DOI: 10.18637/jss.v036.i11]
  31. Bottino, F. et al. COVID mortality prediction with machine learning methods: A systematic review and critical appraisal. J. Pers. Med. 11, 9 (2021). [DOI: 10.3390/jpm11090893]
  32. Berrar, D. Cross-Validation (2019).
  33. Kuhn, M. Building predictive models in R using the caret package. J. Stat. Softw. 28, 1–26 (2008). [DOI: 10.18637/jss.v028.i05]
  34. Kuhn, M. Variable Selection Using the Caret Package. http://cran.cermin.lipi.go.id/web/packages/caret/vignettes/caretSelection.pdf (2012).
  35. Krawczyk, B. Learning from imbalanced data: Open challenges and future directions. Prog. Artif. Intell. 5(4), 221–232 (2016). [DOI: 10.1007/s13748-016-0094-0]
  36. Lunardon, N., Menardi, G. & Torelli, N. ROSE: A package for binary imbalanced learning. R J. 6(1), 79 (2014). [DOI: 10.32614/RJ-2014-008]
  37. Jaccard, P. Étude comparative de la distribution florale dans une portion des Alpes et des Jura. Bull. Soc. Vaudoise Sci. Nat. 37, 547–579 (1901).
  38. Wei, T. et al. Package ‘corrplot’. Statistician 56(316), e24 (2017).
  39. Robin, X. et al. Package ‘pROC’. Package “pROC” (2021).
  40. Tang, J., Alelyani, S. & Liu, H. Feature selection for classification: A review. In Data Classification: Algorithms and Applications 37 (2014).
  41. Xu, W. et al. Risk factors analysis of COVID-19 patients with ARDS and prediction based on machine learning. Sci. Rep. 11(1), 2933 (2021). [PMID: 33536460]
  42. Wu, C. et al. Risk factors associated with acute respiratory distress syndrome and death in patients with coronavirus disease 2019 pneumonia in Wuhan, China. JAMA Intern. Med. 180(7), 934–943 (2020). [PMID: 32167524]
  43. Alirezaei, T. et al. The role of blood urea nitrogen to serum albumin ratio in the prediction of severity and 30-day mortality in patients with COVID-19. Health Sci. Rep. 5(3), e606 (2022). [PMID: 35572169]
  44. Liu, Y. et al. Neutrophil-to-lymphocyte ratio as an independent risk factor for mortality in hospitalized patients with COVID-19. J. Infect. 81(1), e6–e12 (2020). [PMID: 32663485]
  45. Liu, Y.-F. et al. The chronic kidney disease and acute kidney injury involvement in COVID-19 pandemic: A systematic review and meta-analysis. PLoS ONE 16(1), e0244779 (2021). [PMID: 33400721]
  46. Syed, A. H., Khan, T. & Alromema, N. A hybrid feature selection approach to screen a novel set of blood biomarkers for early COVID-19 mortality prediction. Diagnostics 12, 7 (2022). [DOI: 10.3390/diagnostics12071604]
  47. Brinati, D. et al. Detection of COVID-19 infection from routine blood exams with machine learning: A feasibility study. J. Med. Syst. 44, 1–12 (2020). [DOI: 10.1007/s10916-020-01597-4]
  48. Liang, W. et al. Development and validation of a clinical risk score to predict the occurrence of critical illness in hospitalized patients with COVID-19. JAMA Intern. Med. 180(8), 1081–1089 (2020). [PMID: 32396163]
  49. Amini, N. et al. Automated prediction of COVID-19 mortality outcome using clinical and laboratory data based on hierarchical feature selection and random forest classifier. Comput. Methods Biomech. Biomed. Eng. 26(2), 160–173 (2023). [DOI: 10.1080/10255842.2022.2050906]

MeSH Term

Humans
COVID-19
Machine Learning
Retrospective Studies
Male
Female
Middle Aged
SARS-CoV-2
Algorithms
Iran
Aged
Adult

Word Cloud

Created with Highcharts 10.0.0FSaccuracyMLmodels0levelsearlydiseasedetectionselectionpredictivevariablesCOVID-19performancemethodsoutcomesclinicaltechniquescontextmachinelearningemergedvitaltoolFeaturealgorithmsplaycrucialroleensuringidentifyinginfluentialstudyfocusingretrospectivecohort4778patientsIranexploresvariousincludingfilterembeddedhybridapproachespredictingmortalityresearchersleveraged115routinelaboratorydemographicfeaturesemployed13assesseffectivenessbasedclassificationstatisticaltestsresultsindicateHybridBoruta-VImodelcombinedRandomForestalgorithmdemonstratedsuperiorachieving89F1score76AUCvalue95testdataKeyidentifiedimportantpredictorsadverseincludeageoxygensaturationalbuminneutrophilcountsplateletmarkerskidneyfunctionfindingshighlightpotentialadvancedenhancinginformingdecision-makingComparativeanalysisfeaturedataset

Similar Articles

Cited By