Zero-shot-capable identification of phage-host relationships with whole-genome sequence representation by contrastive learning.

Yao-Zhong Zhang, Yunjie Liu, Zeheng Bai, Kosuke Fujimoto, Satoshi Uematsu, Seiya Imoto
Author Information
  1. Yao-Zhong Zhang: Division of Health Medical Intelligence, Human Genome Center, The Institute of Medical Science, The University of Tokyo, Shirokanedai 4-6-1, Minato-ku, 108-8639 Tokyo, Japan. ORCID
  2. Yunjie Liu: Division of Health Medical Intelligence, Human Genome Center, The Institute of Medical Science, The University of Tokyo, Shirokanedai 4-6-1, Minato-ku, 108-8639 Tokyo, Japan.
  3. Zeheng Bai: Division of Health Medical Intelligence, Human Genome Center, The Institute of Medical Science, The University of Tokyo, Shirokanedai 4-6-1, Minato-ku, 108-8639 Tokyo, Japan.
  4. Kosuke Fujimoto: Department of Immunology and Genomics, Graduate School of Medicine, Osaka Metropolitan University, Asahi-machi 1-4-3, Abeno-ku, 545-8585 Osaka, Japan.
  5. Satoshi Uematsu: Department of Immunology and Genomics, Graduate School of Medicine, Osaka Metropolitan University, Asahi-machi 1-4-3, Abeno-ku, 545-8585 Osaka, Japan.
  6. Seiya Imoto: Division of Health Medical Intelligence, Human Genome Center, The Institute of Medical Science, The University of Tokyo, Shirokanedai 4-6-1, Minato-ku, 108-8639 Tokyo, Japan.

Abstract

Accurately identifying phage-host relationships from their genome sequences is still challenging, especially for those phages and hosts with less homologous sequences. In this work, focusing on identifying the phage-host relationships at the species and genus level, we propose a contrastive learning based approach to learn whole-genome sequence embeddings that can take account of phage-host interactions (PHIs). Contrastive learning is used to make phages infecting the same hosts close to each other in the new representation space. Specifically, we rephrase whole-genome sequences with frequency chaos game representation (FCGR) and learn latent embeddings that 'encapsulate' phages and host relationships through contrastive learning. The contrastive learning method works well on the imbalanced dataset. Based on the learned embeddings, a proposed pipeline named CL4PHI can predict known hosts and unseen hosts in training. We compare our method with two recently proposed state-of-the-art learning-based methods on their benchmark datasets. The experiment results demonstrate that the proposed method using contrastive learning improves the prediction accuracy on known hosts and demonstrates a zero-shot prediction capability on unseen hosts. In terms of potential applications, the rapid pace of genome sequencing across different species has resulted in a vast amount of whole-genome sequencing data that require efficient computational methods for identifying phage-host interactions. The proposed approach is expected to address this need by efficiently processing whole-genome sequences of phages and prokaryotic hosts and capturing features related to phage-host relationships for genome sequence representation. This approach can be used to accelerate the discovery of phage-host interactions and aid in the development of phage-based therapies for infectious diseases.

Keywords

References

  1. Cell Host Microbe. 2020 Sep 9;28(3):380-389.e9 [PMID: 32652061]
  2. Nucleic Acids Res. 1990 Apr 25;18(8):2163-70 [PMID: 2336393]
  3. PeerJ. 2015 May 28;3:e985 [PMID: 26038737]
  4. BMC Biol. 2021 Jan 14;19(1):5 [PMID: 33441133]
  5. Phage (New Rochelle). 2022 Dec 1;3(4):204-212 [PMID: 36793881]
  6. Nat Commun. 2014 Jul 24;5:4498 [PMID: 25058116]
  7. Genome Res. 2012 Oct;22(10):1985-94 [PMID: 22732228]
  8. Bioinformatics. 2017 Oct 01;33(19):3113-3114 [PMID: 28957499]
  9. Brief Bioinform. 2022 Jan 17;23(1): [PMID: 34553750]
  10. Patterns (N Y). 2021 Jun 15;2(7):100274 [PMID: 34286299]
  11. Front Microbiol. 2022 Jul 14;13:946070 [PMID: 35910653]
  12. Nucleic Acids Res. 2017 Jan 9;45(1):39-53 [PMID: 27899557]
  13. BMC Bioinformatics. 2009 Dec 15;10:421 [PMID: 20003500]
  14. Mol Biol Evol. 1999 Oct;16(10):1391-9 [PMID: 10563018]
  15. Bioinformatics. 2022 Jan 3;38(2):543-545 [PMID: 34383025]
  16. Front Microbiol. 2020 Oct 29;11:579452 [PMID: 33193205]
  17. Microbiome. 2017 Jul 6;5(1):69 [PMID: 28683828]
  18. BMC Genomics. 2006 Jan 18;7:8 [PMID: 16417644]
  19. Nucleic Acids Res. 2006;34(20):5839-51 [PMID: 17062630]
  20. Comput Struct Biotechnol J. 2021 Nov 10;19:6263-6271 [PMID: 34900136]
  21. PeerJ. 2016 Feb 08;4:e1603 [PMID: 26870609]
  22. Brief Bioinform. 2022 Sep 20;23(5): [PMID: 35595715]

MeSH Term

Bacteriophages
Genome, Viral
Whole Genome Sequencing
Chromosome Mapping

Word Cloud

Created with Highcharts 10.0.0phage-hosthostslearningcontrastivewhole-genomerelationshipsrepresentationsequencesphagessequenceproposedidentifyinggenomeapproachembeddingscaninteractionsmethodspecieslearnusedknownunseenmethodspredictionsequencingidentificationAccuratelystillchallengingespeciallylesshomologousworkfocusinggenuslevelproposebasedtakeaccountPHIsContrastivemakeinfectingclosenewspaceSpecificallyrephrasefrequencychaosgameFCGRlatent'encapsulate'hostworkswellimbalanceddatasetBasedlearnedpipelinenamedCL4PHIpredicttrainingcomparetworecentlystate-of-the-artlearning-basedbenchmarkdatasetsexperimentresultsdemonstrateusingimprovesaccuracydemonstrateszero-shotcapabilitytermspotentialapplicationsrapidpaceacrossdifferentresultedvastamountdatarequireefficientcomputationalexpectedaddressneedefficientlyprocessingprokaryoticcapturingfeaturesrelatedacceleratediscoveryaiddevelopmentphage-basedtherapiesinfectiousdiseasesZero-shot-capablephage–host

Similar Articles

Cited By