Machine Learning Early Detection of SARS-CoV-2 High-Risk Variants.

Lun Li, Cuiping Li, Na Li, Dong Zou, Wenming Zhao, Hong Luo, Yongbiao Xue, Zhang Zhang, Yiming Bao, Shuhui Song
Author Information
  1. Lun Li: China National Center for Bioinformation, Beijing, 100101, China. ORCID
  2. Cuiping Li: China National Center for Bioinformation, Beijing, 100101, China.
  3. Na Li: China National Center for Bioinformation, Beijing, 100101, China.
  4. Dong Zou: China National Center for Bioinformation, Beijing, 100101, China.
  5. Wenming Zhao: China National Center for Bioinformation, Beijing, 100101, China.
  6. Hong Luo: China National Center for Bioinformation, Beijing, 100101, China.
  7. Yongbiao Xue: China National Center for Bioinformation, Beijing, 100101, China.
  8. Zhang Zhang: China National Center for Bioinformation, Beijing, 100101, China.
  9. Yiming Bao: China National Center for Bioinformation, Beijing, 100101, China.
  10. Shuhui Song: China National Center for Bioinformation, Beijing, 100101, China. ORCID

Abstract

The severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has evolved many high-risk variants, resulting in repeated COVID-19 waves over the past years. Therefore, accurate early warning of high-risk variants is vital for epidemic prevention and control. However, detecting high-risk variants through experimental and epidemiological research is time-consuming and often lags behind the emergence and spread of these variants. In this study, HiRisk-Detector a machine learning algorithm based on haplotype network, is developed for computationally early detecting high-risk SARS-CoV-2 variants. Leveraging over 7.6 million high-quality and complete SARS-CoV-2 genomes and metadata, the effectiveness, robustness, and generalizability of HiRisk-Detector are validated. First, HiRisk-Detector is evaluated on actual empirical data, successfully detecting all 13 high-risk variants, preceding World Health Organization announcements by 27 days on average. Second, its robustness is tested by reducing sequencing intensity to one-fourth, noting only a minimal delay of 3.8 days, demonstrating its effectiveness. Third, HiRisk-Detector is applied to detect risks among SARS-CoV-2 Omicron variant sub-lineages, confirming its broad applicability and high ROC-AUC and PR-AUC performance. Overall, HiRisk-Detector features powerful capacity for early detection of high-risk variants, bearing great utility for any public emergency caused by infectious diseases or viruses.

Keywords

References

  1. Brief Bioinform. 2022 May 13;23(3): [PMID: 35233612]
  2. Yi Chuan. 2020 Feb 20;42(2):212-221 [PMID: 32102777]
  3. Proc Natl Acad Sci U S A. 2020 Apr 28;117(17):9241-9243 [PMID: 32269081]
  4. Nat Microbiol. 2021 Mar;6(3):415 [PMID: 33514928]
  5. Sci Transl Med. 2022 Feb 23;14(633):eabk3445 [PMID: 35014856]
  6. Front Bioeng Biotechnol. 2020 Jan 31;8:34 [PMID: 32083072]
  7. Comput Biol Med. 2023 Mar;155:106618 [PMID: 36774893]
  8. Science. 2022 Jun 17;376(6599):1327-1332 [PMID: 35608456]
  9. Nucleic Acids Res. 2022 Jan 7;50(D1):D888-D897 [PMID: 34634813]
  10. Nature. 2021 May;593(7857):130-135 [PMID: 33684923]
  11. Nat Microbiol. 2020 Nov;5(11):1403-1407 [PMID: 32669681]
  12. Mol Biol Evol. 1999 Jan;16(1):37-48 [PMID: 10331250]
  13. Genomics Proteomics Bioinformatics. 2023 Oct;21(5):1066-1079 [PMID: 37898309]
  14. Nat Commun. 2024 Jan 20;15(1):648 [PMID: 38245511]
  15. J Travel Med. 2020 Dec 23;27(8): [PMID: 32776124]
  16. Genetics. 1992 Oct;132(2):619-33 [PMID: 1385266]
  17. BMJ Health Care Inform. 2022 Dec;29(1): [PMID: 36593658]
  18. Brief Bioinform. 2023 May 19;24(3): [PMID: 37170752]
  19. Genomics Proteomics Bioinformatics. 2020 Dec;18(6):749-759 [PMID: 33704069]
  20. Virus Evol. 2021 Jul 30;7(2):veab064 [PMID: 34527285]
  21. G3 (Bethesda). 2021 Aug 7;11(8): [PMID: 33892501]
  22. Genomics Proteomics Bioinformatics. 2021 Oct;19(5):727-740 [PMID: 34695600]
  23. Adv Sci (Weinh). 2024 Dec;11(45):e2405058 [PMID: 39401400]

Grants

  1. 2023YFC3041500/Ministry of Science and Technology of the People's Republic of China
  2. ANSO-CR-KP-2022-09/Alliance of National and International��Science Organizations for the Belt��and��Road Regions
  3. Z211100002121006/Beijing Municipal Science & Technology Commission, Administrative Commission of Zhongguancun Science Park
  4. 2021YFF0703703/National Key Research & Development Program of China
  5. 2023YFC2604400/National Key Research & Development Program of China
  6. XDB38030200/Chinese Academy of Sciences
  7. Y2021038/Chinese Academy of Sciences
  8. 32170678/National Natural Science Foundation of China
  9. 32270718/National Natural Science Foundation of China

MeSH Term

SARS-CoV-2
Humans
COVID-19
Machine Learning
Genome, Viral
Algorithms

Word Cloud

Created with Highcharts 10.0.0variantshigh-riskSARS-CoV-2HiRisk-Detectorearlydetectingmachinelearninghaplotypenetworkeffectivenessrobustnessdaysvariantsevereacuterespiratorysyndromecoronavirus2evolvedmanyresultingrepeatedCOVID-19wavesover thepastyearsThereforeaccuratewarningvitalepidemicpreventioncontrolHoweverexperimentalepidemiologicalresearchtime-consumingoftenlagsbehindemergencespreadstudyalgorithmbaseddevelopedcomputationallyLeveraging76millionhigh-qualitycompletegenomesmetadatageneralizabilityvalidatedFirstevaluatedactualempiricaldatasuccessfully13precedingWorldHealthOrganizationannouncements27averageSecondtestedreducingsequencingintensityone-fourthnotingminimaldelay38demonstratingThirdapplieddetectrisksamongOmicronsub-lineagesconfirmingbroadapplicabilityhighROC-AUCPR-AUCperformanceOverallfeaturespowerfulcapacitydetectionbearinggreatutilitypublicemergencycausedinfectiousdiseasesvirusesMachineLearningEarlyDetectionHigh-RiskVariantsSARS���CoV���2high���riskpre���warning

Similar Articles

Cited By (2)