Extracting relevant predictive variables for COVID-19 severity prognosis: An exhaustive comparison of feature selection techniques.

Miren Hayet-Otero, Fernando García-García, Dae-Jin Lee, Joaquín Martínez-Minaya, Pedro Pablo España Yandiola, Isabel Urrutia Landa, Mónica Nieves Ermecheo, José María Quintana, Rosario Menéndez, Antoni Torres, Rafael Zalacain Jorge, Inmaculada Arostegui, with the COVID-19 & Air Pollution Working Group
Author Information
  1. Miren Hayet-Otero: Basque Center for Applied Mathematics (BCAM), Bilbao, Basque Country, Spain. ORCID
  2. Fernando García-García: Basque Center for Applied Mathematics (BCAM), Bilbao, Basque Country, Spain. ORCID
  3. Dae-Jin Lee: Basque Center for Applied Mathematics (BCAM), Bilbao, Basque Country, Spain. ORCID
  4. Joaquín Martínez-Minaya: Department of Applied Statistics and Operational Research, and Quality, Universitat Politècnica de València (UPV), Valencia, Valencian Community, Spain. ORCID
  5. Pedro Pablo España Yandiola: Respiratory Service, Galdakao-Usansolo University Hospital, Galdakao, Basque Country, Spain.
  6. Isabel Urrutia Landa: BioCruces Bizkaia Health Research Institute, Barakaldo, Basque Country, Spain.
  7. Mónica Nieves Ermecheo: BioCruces Bizkaia Health Research Institute, Barakaldo, Basque Country, Spain.
  8. José María Quintana: Research Unit, Galdakao-Usansolo University Hospital, Galdakao, Basque Country, Spain.
  9. Rosario Menéndez: Pneumology Department, La Fe University and Polytechnic Hospital, Valencia, Valencian Community, Spain.
  10. Antoni Torres: Pneumology Department, Hospital Clínic of Barcelona, Barcelona, Catalonia, Spain.
  11. Rafael Zalacain Jorge: Pneumology Service, Cruces University Hospital, Barakaldo, Basque Country, Spain.
  12. Inmaculada Arostegui: Basque Center for Applied Mathematics (BCAM), Bilbao, Basque Country, Spain.

Abstract

With the COVID-19 pandemic having caused unprecedented numbers of infections and deaths, large research efforts have been undertaken to increase our understanding of the disease and the factors which determine diverse clinical evolutions. Here we focused on a fully data-driven exploration regarding which factors (clinical or otherwise) were most informative for SARS-CoV-2 pneumonia severity prediction via machine learning (ML). In particular, feature selection techniques (FS), designed to reduce the dimensionality of data, allowed us to characterize which of our variables were the most useful for ML prognosis. We conducted a multi-centre clinical study, enrolling n = 1548 patients hospitalized due to SARS-CoV-2 pneumonia: where 792, 238, and 598 patients experienced low, medium and high-severity evolutions, respectively. Up to 106 patient-specific clinical variables were collected at admission, although 14 of them had to be discarded for containing ⩾60% missing values. Alongside 7 socioeconomic attributes and 32 exposures to air pollution (chronic and acute), these became d = 148 features after variable encoding. We addressed this ordinal classification problem both as a ML classification and regression task. Two imputation techniques for missing data were explored, along with a total of 166 unique FS algorithm configurations: 46 filters, 100 wrappers and 20 embeddeds. Of these, 21 setups achieved satisfactory bootstrap stability (⩾0.70) with reasonable computation times: 16 filters, 2 wrappers, and 3 embeddeds. The subsets of features selected by each technique showed modest Jaccard similarities across them. However, they consistently pointed out the importance of certain explanatory variables. Namely: patient's C-reactive protein (CRP), pneumonia severity index (PSI), respiratory rate (RR) and oxygen levels -saturation Sp O2, quotients Sp O2/RR and arterial Sat O2/Fi O2-, the neutrophil-to-lymphocyte ratio (NLR) -to certain extent, also neutrophil and lymphocyte counts separately-, lactate dehydrogenase (LDH), and procalcitonin (PCT) levels in blood. A remarkable agreement has been found a posteriori between our strategy and independent clinical research works investigating risk factors for COVID-19 severity. Hence, these findings stress the suitability of this type of fully data-driven approaches for knowledge extraction, as a complementary to clinical perspectives.

References

  1. Sci Rep. 2021 Jan 11;11(1):464 [PMID: 33431958]
  2. PLOS Digit Health. 2022 Oct 17;1(10):e0000132 [PMID: 36812557]
  3. Int J Infect Dis. 2020 Sep;98:84-89 [PMID: 32553714]
  4. Appl Energy. 2020 Dec 1;279:115835 [PMID: 32952266]
  5. Resusc Plus. 2020 Dec;4:100042 [PMID: 33403367]
  6. Int J Epidemiol. 2021 May 17;50(2):420-429 [PMID: 33683344]
  7. Rev Saude Publica. 2020;54:60 [PMID: 32491116]
  8. Int J Environ Res Public Health. 2021 Jan 30;18(3): [PMID: 33573323]
  9. Inform Med Unlocked. 2021;24:100564 [PMID: 33842685]
  10. Sci Rep. 2021 Feb 25;11(1):4673 [PMID: 33633145]
  11. Bioinformatics. 2001 Jun;17(6):520-5 [PMID: 11395428]
  12. Wellcome Open Res. 2019 Apr 1;4:63 [PMID: 31069261]
  13. Sci Rep. 2022 Aug 3;12(1):13317 [PMID: 35922645]
  14. J Biomed Inform. 2018 Sep;85:189-203 [PMID: 30031057]
  15. Diagnostics (Basel). 2020 Aug 21;10(9): [PMID: 32825763]
  16. N Engl J Med. 1997 Jan 23;336(4):243-50 [PMID: 8995086]
  17. J Health Monit. 2020 Oct 09;5(Suppl 7):3-17 [PMID: 35146298]
  18. PLoS One. 2022 Sep 22;17(9):e0274171 [PMID: 36137106]
  19. Front Public Health. 2020 Nov 26;8:580057 [PMID: 33324598]
  20. J Biomed Inform. 2018 Sep;85:168-188 [PMID: 30030120]
  21. Eur Heart J. 2021 Jun 14;42(23):2270-2279 [PMID: 33448289]
  22. Curr Res Transl Med. 2022 Jan;70(1):103319 [PMID: 34768217]
  23. Clin Infect Dis. 2020 Jul 28;71(15):833-840 [PMID: 32296824]
  24. Environ Sci Pollut Res Int. 2022 Jan;29(4):6267-6277 [PMID: 34448138]
  25. BMJ. 2020 Apr 7;369:m1328 [PMID: 32265220]
  26. Emerg Infect Dis. 2022 Nov;28(11):2243-2252 [PMID: 36220130]
  27. IEEE J Biomed Health Inform. 2024 Feb 08;PP: [PMID: 38329848]
  28. J Expo Sci Environ Epidemiol. 2022 Jul;32(4):604-614 [PMID: 34455418]
  29. Int J Mol Sci. 2022 Mar 26;23(7): [PMID: 35408994]
  30. Med Mal Infect. 2020 Jun;50(4):332-334 [PMID: 32243911]
  31. J Infect. 2020 Aug;81(2):255-259 [PMID: 32447007]
  32. Ecotoxicol Environ Saf. 2020 Nov;204:111035 [PMID: 32768746]
  33. J Clin Med. 2020 May 20;9(5): [PMID: 32443899]
  34. Environ Res. 2022 Jan;203:111930 [PMID: 34425111]
  35. Sci Total Environ. 2020 Oct 10;738:139853 [PMID: 32513529]
  36. Clin Respir J. 2021 May;15(5):467-471 [PMID: 33417280]
  37. Am J Hypertens. 2021 Apr 2;34(3):282-290 [PMID: 33386395]
  38. J Bioinform Comput Biol. 2005 Apr;3(2):185-205 [PMID: 15852500]
  39. Environ Adv. 2022 Jul;8:100250 [PMID: 35692605]
  40. Pathogens. 2021 Jan 11;10(1): [PMID: 33440649]
  41. Front Public Health. 2021 May 12;9:626697 [PMID: 34055710]
  42. J Clin Med. 2022 Apr 16;11(8): [PMID: 35456328]
  43. Rev Med Virol. 2020 Nov;30(6):1-9 [PMID: 32845568]
  44. J Am Geriatr Soc. 2017 Aug;65(8):1796-1801 [PMID: 28407209]
  45. IEEE Trans Pattern Anal Mach Intell. 2005 Aug;27(8):1226-38 [PMID: 16119262]
  46. Int J Environ Res Public Health. 2021 Nov 02;18(21): [PMID: 34770046]
  47. Crit Rev Clin Lab Sci. 2020 Sep;57(6):389-399 [PMID: 32503382]
  48. BMC Bioinformatics. 2013 Mar 22;14:106 [PMID: 23522326]
  49. BMJ Open. 2021 Jan 11;11(1):e044640 [PMID: 33431495]
  50. J Clin Lab Anal. 2020 Oct;34(10):e23618 [PMID: 33078400]
  51. Inform Med Unlocked. 2022;28:100825 [PMID: 34977330]
  52. Environ Res. 2022 Dec;215(Pt 1):114155 [PMID: 36030916]
  53. PLoS One. 2020 Dec 28;15(12):e0244171 [PMID: 33370364]

MeSH Term

Humans
COVID-19
SARS-CoV-2
Pandemics
Pneumonia
Prognosis
Retrospective Studies

Word Cloud

Created with Highcharts 10.0.0clinicalseverityvariablesCOVID-19factorsMLtechniquesresearchevolutionsfullydata-drivenSARS-CoV-2pneumoniafeatureselectionFSdata=patientsmissingfeaturesclassificationfilterswrappersembeddedscertainlevelsSppandemiccausedunprecedentednumbersinfectionsdeathslargeeffortsundertakenincreaseunderstandingdiseasedeterminediversefocusedexplorationregardingotherwiseinformativepredictionviamachinelearningparticulardesignedreducedimensionalityalloweduscharacterizeusefulprognosisconductedmulti-centrestudyenrollingn1548hospitalizedduepneumonia:792238598experiencedlowmediumhigh-severityrespectively106patient-specificcollectedadmissionalthough14discardedcontaining⩾60%valuesAlongside7socioeconomicattributes32exposuresairpollutionchronicacutebecamed148variableencodingaddressedordinalproblemregressiontaskTwoimputationexploredalongtotal166uniquealgorithmconfigurations:461002021setupsachievedsatisfactorybootstrapstability⩾070reasonablecomputationtimes:1623subsetsselectedtechniqueshowedmodestJaccardsimilaritiesacrossHoweverconsistentlypointedimportanceexplanatoryNamely:patient'sC-reactiveproteinCRPindexPSIrespiratoryrateRRoxygen-saturationO2quotientsO2/RRarterialSatO2/FiO2-neutrophil-to-lymphocyteratioNLR-toextentalsoneutrophillymphocytecountsseparately-lactatedehydrogenaseLDHprocalcitoninPCTbloodremarkableagreementfoundposterioristrategyindependentworksinvestigatingriskHencefindingsstresssuitabilitytypeapproachesknowledgeextractioncomplementaryperspectivesExtractingrelevantpredictiveprognosis:exhaustivecomparison

Similar Articles

Cited By (2)