Harmonization of supervised machine learning practices for efficient source attribution of Listeria monocytogenes based on genomic data.

Pierluigi Castelli, Andrea De Ruvo, Andrea Bucciacchio, Nicola D'Alterio, Cesare Camm��, Adriano Di Pasquale, Nicolas Radomski
Author Information
  1. Pierluigi Castelli: Istituto Zooprofilattico Sperimentale dell'Abruzzo e del Molise "Giuseppe Caporale" (IZSAM), National Reference Centre (NRC) for Whole Genome Sequencing of microbial pathogens: data base and bioinformatics analysis (GENPAT), Via Campo Boario, Teramo, TE, 64100, Italy. ORCID
  2. Andrea De Ruvo: Istituto Zooprofilattico Sperimentale dell'Abruzzo e del Molise "Giuseppe Caporale" (IZSAM), National Reference Centre (NRC) for Whole Genome Sequencing of microbial pathogens: data base and bioinformatics analysis (GENPAT), Via Campo Boario, Teramo, TE, 64100, Italy. ORCID
  3. Andrea Bucciacchio: Istituto Zooprofilattico Sperimentale dell'Abruzzo e del Molise "Giuseppe Caporale" (IZSAM), National Reference Centre (NRC) for Whole Genome Sequencing of microbial pathogens: data base and bioinformatics analysis (GENPAT), Via Campo Boario, Teramo, TE, 64100, Italy. ORCID
  4. Nicola D'Alterio: Istituto Zooprofilattico Sperimentale dell'Abruzzo e del Molise "Giuseppe Caporale" (IZSAM), National Reference Centre (NRC) for Whole Genome Sequencing of microbial pathogens: data base and bioinformatics analysis (GENPAT), Via Campo Boario, Teramo, TE, 64100, Italy. ORCID
  5. Cesare Camm��: Istituto Zooprofilattico Sperimentale dell'Abruzzo e del Molise "Giuseppe Caporale" (IZSAM), National Reference Centre (NRC) for Whole Genome Sequencing of microbial pathogens: data base and bioinformatics analysis (GENPAT), Via Campo Boario, Teramo, TE, 64100, Italy. ORCID
  6. Adriano Di Pasquale: Istituto Zooprofilattico Sperimentale dell'Abruzzo e del Molise "Giuseppe Caporale" (IZSAM), National Reference Centre (NRC) for Whole Genome Sequencing of microbial pathogens: data base and bioinformatics analysis (GENPAT), Via Campo Boario, Teramo, TE, 64100, Italy. ORCID
  7. Nicolas Radomski: Istituto Zooprofilattico Sperimentale dell'Abruzzo e del Molise "Giuseppe Caporale" (IZSAM), National Reference Centre (NRC) for Whole Genome Sequencing of microbial pathogens: data base and bioinformatics analysis (GENPAT), Via Campo Boario, Teramo, TE, 64100, Italy. n.radomski@izs.it. ORCID

Abstract

BACKGROUND: Genomic data-based machine learning tools are promising for real-time surveillance activities performing source attribution of foodborne bacteria such as Listeria monocytogenes. Given the heterogeneity of machine learning practices, our aim was to identify those influencing the source prediction performance of the usual holdout method combined with the repeated k-fold cross-validation method.
METHODS: A large collection of 1 100 L. monocytogenes genomes with known sources was built according to several genomic metrics to ensure authenticity and completeness of genomic profiles. Based on these genomic profiles (i.e. 7-locus alleles, core alleles, accessory genes, core SNPs and pan kmers), we developed a versatile workflow assessing prediction performance of different combinations of training dataset splitting (i.e. 50, 60, 70, 80 and 90%), data preprocessing (i.e. with or without near-zero variance removal), and learning models (i.e. BLR, ERT, RF, SGB, SVM and XGB). The performance metrics included accuracy, Cohen's kappa, F1-score, area under the curves from receiver operating characteristic curve, precision recall curve or precision recall gain curve, and execution time.
RESULTS: The testing average accuracies from accessory genes and pan kmers were significantly higher than accuracies from core alleles or SNPs. While the accuracies from 70 and 80% of training dataset splitting were not significantly different, those from 80% were significantly higher than the other tested proportions. The near-zero variance removal did not allow to produce results for 7-locus alleles, did not impact significantly the accuracy for core alleles, accessory genes and pan kmers, and decreased significantly accuracy for core SNPs. The SVM and XGB models did not present significant differences in accuracy between each other and reached significantly higher accuracies than BLR, SGB, ERT and RF, in this order of magnitude. However, the SVM model required more computing power than the XGB model, especially for high amount of descriptors such like core SNPs and pan kmers.
CONCLUSIONS: In addition to recommendations about machine learning practices for L. monocytogenes source attribution based on genomic data, the present study also provides a freely available workflow to solve other balanced or unbalanced multiclass phenotypes from binary and categorical genomic profiles of other microorganisms without source code modifications.

Keywords

References

  1. Proc Natl Acad Sci U S A. 2011 Dec 6;108(49):19484-91 [PMID: 22114192]
  2. Nat Microbiol. 2016 Oct 10;2:16185 [PMID: 27723724]
  3. Front Microbiol. 2019 Oct 01;10:2282 [PMID: 31632381]
  4. J Comput Biol. 2012 May;19(5):455-77 [PMID: 22506599]
  5. BMC Med Res Methodol. 2021 Jan 7;21(1):9 [PMID: 33413132]
  6. Microorganisms. 2020 Apr 10;8(4): [PMID: 32290186]
  7. Nat Rev Microbiol. 2018 Jan;16(1):32-46 [PMID: 29176582]
  8. Sci Rep. 2022 Sep 22;12(1):15817 [PMID: 36138111]
  9. BMC Genomics. 2021 Oct 30;22(1):782 [PMID: 34717546]
  10. BMC Genomics. 2022 Mar 26;23(1):235 [PMID: 35346021]
  11. Bioinformatics. 2009 Jul 15;25(14):1754-60 [PMID: 19451168]
  12. BMC Med Genomics. 2011 Apr 08;4:31 [PMID: 21477282]
  13. Biochem Med (Zagreb). 2012;22(3):276-82 [PMID: 23092060]
  14. One Health Outlook. 2020;2(1):20 [PMID: 33103064]
  15. Genome Biol. 2019 Dec 18;20(1):286 [PMID: 31849328]
  16. Sci Data. 2022 Apr 28;9(1):190 [PMID: 35484273]
  17. Proc Natl Acad Sci U S A. 2021 May 18;118(20): [PMID: 33986113]
  18. BMC Bioinformatics. 2010 Dec 10;11:595 [PMID: 21143983]
  19. Front Microbiol. 2021 Jul 13;12:703890 [PMID: 34326828]
  20. PLoS One. 2013 Jun 25;8(6):e67511 [PMID: 23825666]
  21. Bioinformatics. 2015 Nov 15;31(22):3691-3 [PMID: 26198102]
  22. PLoS One. 2014 Mar 20;9(3):e92209 [PMID: 24651729]
  23. Commun Biol. 2022 Aug 22;5(1):856 [PMID: 35995843]
  24. EFSA J. 2018 Jan 24;16(1):e05134 [PMID: 32760461]
  25. Bioinformatics. 2014 Aug 1;30(15):2114-20 [PMID: 24695404]
  26. Food Microbiol. 2020 Aug;89:103433 [PMID: 32138991]
  27. Sci Adv. 2023 Jan 13;9(2):eadc9130 [PMID: 36630500]
  28. Phys Rep. 2019 May 30;810:1-124 [PMID: 31404441]
  29. Philos Trans R Soc Lond B Biol Sci. 2022 Oct 10;377(1861):20210230 [PMID: 35989608]
  30. World J Gastroenterol. 2022 Feb 7;28(5):605-607 [PMID: 35316964]
  31. Gigascience. 2020 Feb 1;9(2): [PMID: 32025702]
  32. PLoS One. 2014 Feb 04;9(2):e87933 [PMID: 24503703]
  33. Molecules. 2021 Feb 19;26(4): [PMID: 33669834]
  34. J Clin Microbiol. 2022 Aug 17;60(8):e0031122 [PMID: 35852343]
  35. J Korean Med Sci. 2020 Jun 22;35(24):e171 [PMID: 32567255]
  36. BMC Genomics. 2010 Sep 16;11:500 [PMID: 20846431]
  37. Int J Environ Res Public Health. 2018 Dec 19;15(12): [PMID: 30572595]
  38. BMC Med Imaging. 2015 Aug 12;15:29 [PMID: 26263899]
  39. Genome Announc. 2016 Sep 15;4(5): [PMID: 27634991]
  40. Bioinform Adv. 2022 Apr 29;2(1):vbac029 [PMID: 36699393]
  41. Nature. 2009 Jun 18;459(7249):950-6 [PMID: 19448609]
  42. PeerJ. 2019 May 31;7:e6995 [PMID: 31183253]
  43. Risk Anal. 2020 Sep;40(9):1693-1705 [PMID: 32515055]
  44. Lancet Infect Dis. 2014 Nov;14(11):1073-1082 [PMID: 25241232]
  45. Healthc Inform Res. 2021 Jul;27(3):189-199 [PMID: 34384201]
  46. Emerg Infect Dis. 2013 Jan;19(1):1-9; quiz 184 [PMID: 23260661]
  47. Microb Genom. 2017 Jul 4;3(8):e000124 [PMID: 29026660]
  48. BMC Genomics. 2020 Feb 6;21(1):130 [PMID: 32028892]
  49. Proc Natl Acad Sci U S A. 2020 Mar 3;117(9):4571-4577 [PMID: 32071251]
  50. Food Microbiol. 2022 Sep;106:103757 [PMID: 35690455]
  51. Zoonoses Public Health. 2022 Aug;69(5):475-486 [PMID: 35267243]
  52. Genome Biol. 2014 Mar 03;15(3):R46 [PMID: 24580807]
  53. Microb Genom. 2020 Jul;6(7): [PMID: 32320376]
  54. Foodborne Pathog Dis. 2015 Dec;12(12):966-71 [PMID: 26583272]
  55. Microb Genom. 2018 Mar;4(3): [PMID: 29543149]
  56. Front Microbiol. 2020 Mar 24;11:483 [PMID: 32265894]
  57. EFSA J. 2022 Dec 13;20(12):e07666 [PMID: 36524203]
  58. BMC Genomics. 2021 May 26;22(1):389 [PMID: 34039264]
  59. Bioinformatics. 2011 Aug 1;27(15):2156-8 [PMID: 21653522]
  60. Diagnostics (Basel). 2021 Sep 03;11(9): [PMID: 34573951]
  61. PLoS One. 2017 May 4;12(5):e0176857 [PMID: 28472116]
  62. BMC Med. 2019 Dec 16;17(1):230 [PMID: 31842878]
  63. Front Microbiol. 2023 Apr 06;14:1147137 [PMID: 37089559]
  64. Appl Environ Microbiol. 2024 Mar 20;90(3):e0129223 [PMID: 38289130]
  65. Bioinformatics. 2013 Apr 15;29(8):1072-5 [PMID: 23422339]
  66. PLoS Genet. 2021 Oct 18;17(10):e1009436 [PMID: 34662334]
  67. Risk Anal. 2004 Feb;24(1):255-69 [PMID: 15028016]
  68. Microb Biotechnol. 2021 Jul;14(4):1539-1549 [PMID: 34019733]
  69. Front Microbiol. 2023 May 12;14:1118158 [PMID: 37250024]
  70. Pathogens. 2022 Jun 16;11(6): [PMID: 35745545]
  71. Genome Biol. 2020 Jul 22;21(1):180 [PMID: 32698896]
  72. Genetics. 2000 Jun;155(2):945-59 [PMID: 10835412]
  73. PLoS One. 2015 Mar 04;10(3):e0118432 [PMID: 25738806]
  74. BMC Genomics. 2019 Nov 6;20(1):814 [PMID: 31694533]
  75. Bioinformatics. 2014 Jul 15;30(14):2068-9 [PMID: 24642063]
  76. J Mol Biol. 1990 Oct 5;215(3):403-10 [PMID: 2231712]
  77. Elife. 2023 Apr 12;12: [PMID: 37042517]
  78. J Appl Stat. 2019 Dec 24;47(12):2159-2177 [PMID: 35706842]
  79. Front Microbiol. 2017 Nov 29;8:2351 [PMID: 29238330]
  80. Front Microbiol. 2018 Sep 03;9:1983 [PMID: 30233509]
  81. Bioinformatics. 2010 May 15;26(10):1340-7 [PMID: 20385727]

Grants

  1. IZS AM 03/21 RC/Italian Ministry of Health
  2. IZS AM 03/21 RC/Italian Ministry of Health
  3. IZS AM 03/21 RC/Italian Ministry of Health
  4. IZS AM 03/21 RC/Italian Ministry of Health
  5. IZS AM 03/21 RC/Italian Ministry of Health
  6. IZS AM 03/21 RC/Italian Ministry of Health
  7. IZS AM 03/21 RC/Italian Ministry of Health

MeSH Term

Listeria monocytogenes
Genomics
Supervised Machine Learning
Machine Learning
Alleles

Word Cloud

Created with Highcharts 10.0.0learninggenomiccoresignificantlysourceallelesmachineattributionieSNPspankmersdataaccuracyaccuraciesListeriamonocytogenespracticesperformanceprofilesaccessorygenesSVMXGBcurvehigherGenomicpredictionmethodL monocytogenesmetrics7-locusworkflowdifferenttrainingdatasetsplitting70withoutnear-zerovarianceremovalmodelsBLRERTRFSGBprecisionrecall80%presentmodelbasedBACKGROUND:data-basedtoolspromisingreal-timesurveillanceactivitiesperformingfoodbornebacteriaGivenheterogeneityaimidentifyinfluencingusualholdoutcombinedrepeatedk-foldcross-validationMETHODS:largecollection1 100genomesknownsourcesbuiltaccordingseveralensureauthenticitycompletenessBaseddevelopedversatileassessingcombinations50608090%preprocessingincludedCohen'skappaF1-scoreareacurvesreceiveroperatingcharacteristicgainexecutiontimeRESULTS:testingaveragetestedproportionsallowproduceresultsimpactdecreasedsignificantdifferencesreachedordermagnitudeHoweverrequiredcomputingpowerespeciallyhighamountdescriptorslikeCONCLUSIONS:additionrecommendationsstudyalsoprovidesfreelyavailablesolvebalancedunbalancedmulticlassphenotypesbinarycategoricalmicroorganismscodemodificationsHarmonizationsupervisedefficientMachineSource

Similar Articles

Cited By