SARS-CoV-2 virus classification based on stacked sparse autoencoder.

Maria G F Coutinho, Gabriel B M C��mara, Raquel de M Barbosa, Marcelo A C Fernandes
Author Information
  1. Maria G F Coutinho: Laboratory of Machine Learning and Intelligent Instrumentation, IMD/nPITI, Federal University of Rio Grande do Norte, Natal, Brazil.
  2. Gabriel B M C��mara: Laboratory of Machine Learning and Intelligent Instrumentation, IMD/nPITI, Federal University of Rio Grande do Norte, Natal, Brazil.
  3. Raquel de M Barbosa: Department of Pharmacy and Pharmaceutical Technology, University of Granada, 18071 Granada, Spain.
  4. Marcelo A C Fernandes: Laboratory of Machine Learning and Intelligent Instrumentation, IMD/nPITI, Federal University of Rio Grande do Norte, Natal, Brazil.

Abstract

Since December 2019, the world has been intensely affected by the COVID-19 pandemic, caused by the SARS-CoV-2. In the case of a novel virus identification, the early elucidation of taxonomic classification and origin of the virus genomic sequence is essential for strategic planning, containment, and treatments. Deep learning techniques have been successfully used in many viral classification problems associated with viral infection diagnosis, metagenomics, phylogenetics, and analysis. Considering that motivation, the authors proposed an efficient viral genome classifier for the SARS-CoV-2 using the deep neural network based on the stacked sparse autoencoder (SSAE). For the best performance of the model, we explored the utilization of image representations of the complete genome sequences as the SSAE input to provide a classification of the SARS-CoV-2. For that, a dataset based on k-mers image representation was applied. We performed four experiments to provide different levels of taxonomic classification of the SARS-CoV-2. The SSAE technique provided great performance results in all experiments, achieving classification accuracy between 92% and 100% for the validation set and between 98.9% and 100% when the SARS-CoV-2 samples were applied for the test set. In this work, samples of the SARS-CoV-2 were not used during the training process, only during subsequent tests, in which the model was able to infer the correct classification of the samples in the vast majority of cases. This indicates that our model can be adapted to classify other emerging viruses. Finally, the results indicated the applicability of this deep learning technique in genome classification problems.

Keywords

References

  1. Nat Genet. 2019 Jan;51(1):12-18 [PMID: 30478442]
  2. Genome Biol. 2017 Oct 3;18(1):186 [PMID: 28974235]
  3. Nat Med. 2020 Apr;26(4):450-452 [PMID: 32284615]
  4. NAR Genom Bioinform. 2020 May 06;2(2):lqaa031 [PMID: 33575587]
  5. Nat Rev Genet. 2019 Jul;20(7):389-403 [PMID: 30971806]
  6. Quant Biol. 2020 Mar;8(1):64-77 [PMID: 34084563]
  7. BMC Bioinformatics. 2017 Apr 11;18(1):208 [PMID: 28399797]
  8. Front Genet. 2019 Mar 26;10:214 [PMID: 30972100]
  9. PLoS One. 2019 Sep 11;14(9):e0222271 [PMID: 31509583]
  10. Comput Biol Med. 2022 Jun;145:105461 [PMID: 35366470]
  11. IEEE Trans Med Imaging. 2016 Jan;35(1):119-30 [PMID: 26208307]
  12. Biomed J. 2020 Oct;43(5):438-450 [PMID: 33036956]
  13. Comput Methods Programs Biomed. 2018 Nov;166:99-105 [PMID: 30415723]
  14. Comput Biol Med. 2022 Feb;141:105134 [PMID: 34971978]
  15. Comput Biol Med. 2022 Feb;141:105141 [PMID: 34929464]
  16. Bioinformatics. 2017 Feb 15;33(4):574-576 [PMID: 27797770]
  17. Immunity. 2020 May 19;52(5):734-736 [PMID: 32392464]
  18. BMC Bioinformatics. 2019 Jun 17;20(1):341 [PMID: 31208331]
  19. Genome Biol. 2009;10(10):R108 [PMID: 19814784]
  20. Nature. 2020 Jul;583(7815):282-285 [PMID: 32218527]
  21. Genomics. 2019 Dec;111(6):1574-1582 [PMID: 30439480]
  22. Brief Bioinform. 2014 Mar;15(2):256-78 [PMID: 23341494]
  23. BMC Bioinformatics. 2018 Sep 24;19(1):336 [PMID: 30249176]
  24. Comput Struct Biotechnol J. 2021 Aug 12;19:4538-4558 [PMID: 34471498]
  25. PLoS Comput Biol. 2021 Feb 18;17(2):e1008767 [PMID: 33600435]
  26. NAR Genom Bioinform. 2021 Feb 01;3(1):lqab004 [PMID: 33554119]
  27. PLoS One. 2020 Apr 24;15(4):e0232391 [PMID: 32330208]
  28. NAR Genom Bioinform. 2020 Feb 19;2(1):lqaa009 [PMID: 33575556]
  29. Neural Comput Appl. 2022;34(18):15313-15348 [PMID: 35702664]
  30. Bioinformatics. 2021 Apr 20;37(3):318-325 [PMID: 32777818]
  31. Virus Res. 2017 Jul 15;239:17-32 [PMID: 27693290]
  32. Gigascience. 2019 Jun 1;8(6): [PMID: 31220250]
  33. PLoS Comput Biol. 2014 Jul 17;10(7):e1003711 [PMID: 25033408]
  34. J Appl Genet. 2011 Nov;52(4):413-35 [PMID: 21698376]
  35. Methods. 2021 May;189:95-103 [PMID: 32454212]
  36. IEEE Access. 2021 Apr 16;9:59597-59611 [PMID: 34812391]

Word Cloud

Created with Highcharts 10.0.0classificationSARS-CoV-2viruslearningviralgenomebasedautoencoderSSAEmodelsamplesCOVID-19taxonomicDeepusedproblemsdeepstackedsparseperformanceimageprovideappliedexperimentstechniqueresults100%setSinceDecember2019worldintenselyaffectedpandemiccausedcasenovelidentificationearlyelucidationorigingenomicsequenceessentialstrategicplanningcontainmenttreatmentstechniquessuccessfullymanyassociatedinfectiondiagnosismetagenomicsphylogeneticsanalysisConsideringmotivationauthorsproposedefficientclassifierusingneuralnetworkbestexploredutilizationrepresentationscompletesequencesinputdatasetk-mersrepresentationperformedfourdifferentlevelsprovidedgreatachievingaccuracy92%validation989%testworktrainingprocesssubsequenttestsableinfercorrectvastmajoritycasesindicatescanadaptedclassifyemergingvirusesFinallyindicatedapplicabilitySparseViral

Similar Articles

Cited By