Analyzing the effect of data preprocessing techniques using machine learning algorithms on the diagnosis of COVID-19.

Gizemnur Erol, Betül Uzbaş, Cüneyt Yücelbaş, Şule Yücelbaş
Author Information
  1. Gizemnur Erol: Konya Technical University Software Engineering Department Konya Turkey. ORCID
  2. Betül Uzbaş: Konya Technical University Computer Engineering Department Konya Turkey. ORCID
  3. Cüneyt Yücelbaş: Tarsus University Electronics and Automation Department, Mersin-Tarsus OIZ Vocational School of Technical Sciences Mersin Turkey. ORCID
  4. Şule Yücelbaş: Tarsus University Computer Engineering Department Mersin Turkey. ORCID

Abstract

Real-time polymerase chain reaction (RT-PCR) known as the swab test is a diagnostic test that can diagnose COVID-19 disease through respiratory samples in the laboratory. Due to the rapid spread of the coronavirus around the world, the RT-PCR test has become insufficient to get fast results. For this reason, the need for diagnostic methods to fill this gap has arisen and machine learning studies have started in this area. On the other hand, studying medical data is a challenging area because the data it contains is inconsistent, incomplete, difficult to scale, and very large. Additionally, some poor clinical decisions, irrelevant parameters, and limited medical data adversely affect the accuracy of studies performed. Therefore, considering the availability of datasets containing COVID-19 blood parameters, which are less in number than other medical datasets today, it is aimed to improve these existing datasets. In this direction, to obtain more consistent results in COVID-19 machine learning studies, the effect of data preprocessing techniques on the classification of COVID-19 data was investigated in this study. In this study primarily, encoding categorical feature and feature scaling processes were applied to the dataset with 15 features that contain blood data of 279 patients, including gender and age information. Then, the missingness of the dataset was eliminated by using both K-nearest neighbor algorithm (KNN) and chain equations multiple value assignment (MICE) methods. Data balancing has been done with synthetic minority oversampling technique (SMOTE), which is a data balancing method. The effect of data preprocessing techniques on ensemble learning algorithms bagging, AdaBoost, random forest and on popular classifier algorithms KNN classifier, support vector machine, logistic regression, artificial neural network, and decision tree classifiers have been analyzed. The highest accuracies obtained with the bagging classifier were 83.42% and 83.74% with KNN and MICE imputations by applying SMOTE, respectively. On the other hand, the highest accuracy ratio reached with the same classifier without SMOTE was 83.91% for the KNN imputation. In conclusion, certain data preprocessing techniques are examined comparatively and the effect of these data preprocessing techniques on success is presented and the importance of the right combination of data preprocessing to achieve success has been demonstrated by experimental studies.

Keywords

References

  1. Ann Transl Med. 2016 Jan;4(2):30 [PMID: 26889483]
  2. PLoS One. 2022 Jan 13;17(1):e0262448 [PMID: 35025945]
  3. N Engl J Med. 2020 Apr 30;382(18):1708-1720 [PMID: 32109013]
  4. Concurr Comput. 2022 Dec 25;34(28):e7393 [PMID: 36714180]
  5. Diabetes Res Clin Pract. 2018 Oct;144:118-125 [PMID: 30176260]
  6. Appl Soft Comput. 2021 Jul;106:107329 [PMID: 33758581]
  7. Int J Methods Psychiatr Res. 2011 Mar;20(1):40-9 [PMID: 21499542]
  8. Clin Chim Acta. 2020 Aug;507:174-180 [PMID: 32339487]
  9. Clin Chem Lab Med. 2020 Oct 21;59(2):421-431 [PMID: 33079698]
  10. SN Comput Sci. 2021;2(1):11 [PMID: 33263111]
  11. Clin Chem Lab Med. 2020 Jun 25;58(7):1095-1099 [PMID: 32301746]
  12. J Med Syst. 2020 Jul 1;44(8):135 [PMID: 32607737]
  13. Inform Med Unlocked. 2020;21:100449 [PMID: 33102686]
  14. Ann Surg Oncol. 2019 Feb;26(2):669-684 [PMID: 30374917]
  15. BMJ. 2020 Mar 23;368:m1165 [PMID: 32205334]
  16. Nat Med. 2020 Aug;26(8):1224-1228 [PMID: 32427924]
  17. J Med Virol. 2020 Sep;92(9):1518-1524 [PMID: 32104917]
  18. Inform Med Unlocked. 2021;24:100564 [PMID: 33842685]
  19. Brief Bioinform. 2022 Jan 17;23(1): [PMID: 34882223]

Word Cloud

Created with Highcharts 10.0.0datapreprocessingCOVID-19machinelearningtechniquesKNNstudieseffectclassifiertestmedicaldatasetsSMOTEalgorithms83imputationchainRT-PCRdiagnosticresultsmethodsareahandparametersaccuracybloodstudyfeaturedatasetusingMICEbalancingsyntheticminorityoversamplingtechniquebagginghighestsuccessReal-timepolymerasereactionknownswabcandiagnosediseaserespiratorysampleslaboratoryDuerapidspreadcoronavirusaroundworldbecomeinsufficientgetfastreasonneedfillgaparisenstartedstudyingchallengingcontainsinconsistentincompletedifficultscalelargeAdditionallypoorclinicaldecisionsirrelevantlimitedadverselyaffectperformedThereforeconsideringavailabilitycontaininglessnumbertodayaimedimproveexistingdirectionobtainconsistentclassificationinvestigatedprimarilyencodingcategoricalscalingprocessesapplied15featurescontain279patientsincludinggenderageinformationmissingnesseliminatedK-nearestneighboralgorithmequationsmultiplevalueassignmentDatadonemethodensembleAdaBoostrandomforestpopularsupportvectorlogisticregressionartificialneuralnetworkdecisiontreeclassifiersanalyzedaccuraciesobtained42%74%imputationsapplyingrespectivelyratioreachedwithout91%conclusioncertainexaminedcomparativelypresentedimportancerightcombinationachievedemonstratedexperimentalAnalyzingdiagnosisCOVID‐19multivariatechainedequation

Similar Articles

Cited By